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(57) Abstract: A system for generating an audio message over a communications netwoilc (4) that is at least partly in a voice repre- 
sentative of a character generally recognisable to a user. Either a voice message or text based message may be used to construct the 
audio message. Specific recordings of well known chaiacteis is stored in a storage means (14, 213) and background sound eOects 
can be inserted into the audio message which are stored in database (14, 215). The audio message is'constnicted by any one of the 
processing means (12, 212, 214) and transmitted to a recipient for play back on a processing tenninaL 
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SPEECH SYSTEM 

Field of the invention 

The invention relates to generating speech, and lelates paiticulariy but not exclusively to 
systenis and methods of generating speech which involve the playback of messages in 
5 audio format especially for ente^onmwit purposes, smk as in connection witb^cfigital 
coiimiunication systrais and infdmi^ 

Badcground of the invention 

Computer software of increasing sophistication, and hardware of increasing power, has 
10 opened up possibilities for enhanced mtertainment oppcHtunities on digital platforms. 
This includes, for exan^ile, die Jhtemet accessed diroug^ devices such as personal 
ccnnput^ or gaming consoles, digital television and radio qiplications, digital telqphony 
etc, 

IS In particular, there has been a significant growth in the complexity of conqmter games, as 
well as increased use of emaSl systems, chat rooms (such as ICQ and otfim), other instant 
messaging services (such as SMS) and multi-user domains. In most cases, these types of 
q)plications are text-based or at least rely heavily on the use of text Ifowever, to date, 
diese applications have not made significant use of text-to-voice technology to enhance a 

20 user's experience of these types of applications, despite the widespread availability of 
these technologies. 

In applications where computer gmerated voices have been used, the technology has been 
used primarily as a carrier for unprocessed voice signals. For example, Litemet-based 
2S chat rooms (for example, Netmeeting) exist whereby two or more users can conmiunicate 
in their own voices instead of via typed messages. Jn ^plications where text to speech 
technology has been used (for example, email reading programs), the entertainment value 
of the voice has been low due to the provision of usually only one voice, or a small 
number of generic voices (for example US English male). 

30 

Talking toys have a certain entertainment value, but existing toys are usually restricted to 
a fixed sequence or a random selection of pre-recorded messages. In some toys, the 
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sequence of available messages can be determmed by a selection from a set of supplied 

messages. Ihothercases^tfaeuserhastheoppoitunity of makingaiecoi^ 

vdce, such as with a conventional cassette recorder or karioke machine, for use with the 

toy, 

5 

Vsexs of such talkmg toys can quickly tiie of their toy's novelty value as the existing 
. OiiAcm and therr vanous r-c^zrbinations hold limited entertainment mi<s$fiiilities, as ^aei 
are oiJy moderate amusement options wUch are available t^ 

10 It is an object of the invention to at least attempt to address ibssc and otfier limitations of 
the prior art More particulaiiy, it is an object of the invration to address these and other 
deficiencies in connection with the amusemmt value associated with text and audio 
messages especially messages g^orated or processed by digital communications or 
information systems, 

15 

It is an object of die invraticm to address these and oth^ deficiencies in connection with 
tile amusement value associated witii audio messages for rat^tainment purposes in 
connection with talking toys. 

20 

Summary of the imrention 

The invrative concept resides in a recognition that text can desirably be converted into a 
voice representative of a particular character, such as a well known entertainment 
personality or fictional character. This concept has various invmtive qyplications in a 

25 variety of contexts, including use in connection with, for example, text-based messages. 
As an exanq>le, text-based communications such as email or chat-based systems such as 
IRC or ICQ can be enhanced in accordance with the inventive concept by using software 
applications or functionality that allows for playback of text-based messages in the voice 
of a particular character. As a further example, it is possible to provide, in accordance 

30 with the inventive concept, a physical toy which can be configured by a user to play one 
or more voice messages in the voice of a character or personality represented by the 
stylistic design of the toy (for example, Elvis Presley or Homer Simpson). In either case, 
the text-based message can be constructed by the user by typing or otherwise constructing 
the text message representative of the desired audio message. 
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According to a first aspect of the inveation there is provided a method of generating an 
audio message, including: 

providing a text-based message; and 
5 g^erating said audio message based on said text-based message; 

wherein said audio message is at least partly in a voice which is representative of a 
charac^ generally recognisable to a usrvr. 

According to a second aspect of tibe invention fh«e is provided a system for generating 
10 an audio message conqnising: 

means for providing a text-based nuessage; 

mteans for generating said audio message based on said text-based message; 
wherein said audio message is at least partly in a voice which is representative of a 
character generally recognisable to a user. 

15 

According to a third aspect of the invention there is provided a system for 
generating an audio message using a communications network, said system 
comprising: 

means for providing a text-based message linked to said communications 
20 netwotk; 

means for generating said audio message based on said text-based message; 
whereuoL said audio message is at least partly in a voice which is 
representative of a character generally recognisable to a user. 

25 Preferably, the character in whose voice the audio message is generated is selected fix)m a 
predefined Ust of characters which are generaUy recognisable to a user. 

Preferably, the audio message is generated based on the text-based message using a 
textual database which indexes speech units (words, phrases and sub-word phrases) with 
30 corresponding audio recordings representing those speech units. Preferably, the audio 
message is generated by concatenating together one or more audio recordings of speech 
units, the sequence of the concatenated audio recordings being determined with reference 
to indexed speech units associated with one or more of the audio recordings in the 
sequence. 
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Preferably, words in a text-based message which do not have conesponding audio 
lecordings of suitable speech units are substituted with substitute woids which do have 
conesponding audio recordings. Preferably, the substituted word has a closely similar 
5 grammatical meaning to the original word, in the context of the text-based message. 

Prc$er?b!y, a fliesaurus wh^cj^^ indexes a large number of words with altemotrj^'e wonls is 
used to achieve this substitotioifi. Preferably, die original word is substituted with' a 
replacement supported word which has suitably associated audio recordings. Preferably, 

10 the thesaurus can be iteratively searched for alternative words to eventually find a 
supported word having suitably associated audio recordings. Ptefi»ably, use of the 
thesaurus may be extended to include grammatical-based processmg of text-based 
messages, or dicdonary-based processing of text-based messages. Alternatively, 
unsupported words can be synthesised by reproducing a sequence of audio recordings of 

IS suitable atomic speech elranrats (for example, diphcmes) and sqyplying signal processing 
to this sequence to rahance its naturalness. 

Preferably, the siqyported words having associated suitable audio recordings are a 
collection of commonly used words in a particular language that are gmerally actequate 
20 for general communicadon. Preferably, die textual database furdi^ indexes syUables and 
phrases. Preferably, the {rfirases are phrases which are commonly used in the target 
language, or are phrases characteristic of the character. In some cases, it is desirable that 
the phrases include phrases that are purposefiilly or intmtionally out of character. 

25 Preferably, the generation of audio messages optionally involves a preliminary step of 
converting the provided text-based message into a conesponding text-based message 
which is instead used as the basis for generating the audio messa^. 

Preferably, conversion from an original text-based message to a conesponding text-based 
•30 message substitutes the original text-based message with a conesponding text-based 
message which is an idiomatic representation of the original text-based message. 



35 



Preferably, in some embodiments, the conesponding text-based message is in an idiom 
which is attributable to, associated with, or at least compatible with the character. 



wo 01/57851 



PCT/AUOl/00111 



5 

Preferably, in other embodiments, the corresponding text-based message is in an idiom 
which is intentionally inconqiatible with the diaracter, or attribatable to, or associated 
with a different character which is g^erally recognisable by a user. 

5 Preferably, if the text-based message involves a narrative in which multiple narrative 
characters appear, the audio message can be generated in respective multiple voices , each 
representative ofa different character which is generaUyric^^ 

Preferably, only certain words or word strings in an original text-based vrm^ff: are 
10 convoled to a corresponding text-based message whidi is an idiomatic rqxiesCTtation of 
the original text-based messa^ 

Preferably, there can be provided conversicm ftom an original text-based message to a 
corresponding text-based message \Mch involves a translation between two established 
IS human languages, sudi as BtCTch and Bngliah. Of course, translation may involve either a 
source or a tar:^t language which is a constructed or devised language whidi is 
attributable to, associated with, or at least ccmyiatible widi the character (for exaooqple, the 
Pokemon language). Translation between languages may be alt^nalive or additional to 
substitution to an idiom of the character. 

20 

Prrferably, the text-based message is provided by a user. Preferably, the text is entered by 
the user as a sequence of codes using, for example, an alpharnumeric keyboard. 

Preferably, the user provided text4)ased message can include words or other text-based 
2S elements which are selected from a predetemiined list of particular text-based elements. 
This list of text-based elements includes, for example, words as well as conomon phrases 
or expressions. One or more of these words, phrases cmt expressions may be specific to a 
particular character. The text-based elements can include vocal expressions that are 
attributable to, associated with, or at least compatible with the character. 

30 

Prrferably, text-based elements are represented in a text-based message with specific 
codes representative of the respective text-based element Preferably, this is achieved 
using a preliminary escape code sequmce followed by the appropriate code for the text- 
based element Text-based elements can be inserted by users, or inserted automatically to 
35 punctuate, for example, sentences in a text-based message. Alternatively, generation of an 
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audio message can include tiie random insertion of particular vocal expfessions betwem 
certain predetennined audio recordings fiom which ttie audio message is composed 

Preferably, this coded sequence can also be used to express emotions, mai^ changes in 
5 the character identification, ins^ background sounds and canned expressicms in the text- 
based message. Preferably, this coded sequence is based on HTML or XML. 

Preferably, the textual database omits certain words which are not considBreil'suitable, so 
that the generated audio inessages can be crasored to a certain extent 

10 

Preferably, the text-based message can be graierated fitom an audio message by using 
voice recognition tedmology, and subsequently used as the basis for die graeration of an 
audio message in a voice representative of a generally recognisable character. 

IS Prefiearably, a user can qiply one or more audio effects to the audio message. Diese 
effects, for sample, can be used to change the sound charactensdcs of the audio message 
so that it sounds, for example, as if die charact^ is underwater, or has a cold etc. Or 
optionally, the charactenstics of die speech signal (for exaniple, die *T(r signal, or 
phonetic and prosodic models) may be deliberately modified or rqilaced to substantially 

20 modify the charactrastics of die voice. An example, may be a lawn mowo: speaking in a 
voice recognisable as Elvis Presley's. Pteferably, the text-based message is represented m 
a form able to be used by digital con^uters, such as ASCD (American Standard Code for 
Information Interchange). 

25 Preferably, the invaative methods described above are performed using a computing 
device having installed th«dn a suitable operating system able to execute software 
capable of effecting diese methods. Preferably, the methods are performed using a user's 
local computing device, or performed using a computing device with which a user can 
remotely communicate with through a network. Preferably, a number of users provide 

30 text-based messages to a central computing device connected on the Intemet and 
accessible using a World Wde Web (WWW) site, and receive via the Intemet an audio 
message. The audio message can be received as either a file in a standard audio file 
format which is, for example, transferred across the Intemet using the FTP or HTTP 
protocols or as an attachment to an email message. Alternatively, the audio message may 

35 be provided as a streaming audio broadcast to one or more users. 



wo 01/57851 



PCT/AUOl/OOlll 



In embodiments in which an audio message is generated by means of a computing device, 
the option is preferably provided to generate an accompanying animated imagp which 
corresponds with the audio message. Preferably, this option is available where an audio 
5 message is generated by a user's local computing device. Preferably, the audio message 
and the animation are provided in a single audio/visual computer inteipretable file format, 
^jch as Microsoft AVI fomsat, or Apple Quiclrfllme format PrefisTtfUiy, the animatic^, is a 
visTial lepresetitation of the character v^ch **spcaks** Ae audio u^eissag^ and tiie 
character moves in accordance with the audio message. Fbr example, the animated 
10 diaracter preferably moves its mouth and/or other facial or bodily features in response to 
the audio message. Pi:efa:ably, movement of the animated character is synchronised with 
predetermined audio or speech events in the audio messa^. This mi^t include, for 
example, the start and end of words, or the use of certain key fdnases, or signature 
sounds. 

15 

Embodiments of the invraticni are preferably facilitated using a netwodn which allows for 
onmnunication of text-based messages and/or audio messages between users. Preferably, 
a network server can be used to distribute one or more audio message genoafted in 
accordance with embodiments of the invention. 

20 

Preferably, the inventive mediods are used in conjunction with text-based 
comnuinications or messaging systems such as email (electronic mail) or electronic 
greeting cards or chat-based systems such as IRC (Ihtemet relay chat) or ICQ (or other 
IP-to-IP messaging systems). In these cases, the text-based message is provided, or at 
25 least derived from the text of the text messa^ of the email message, electronic greeting 
card or chat line. 

Preferably, when said inventive methods are used in conjunction with email or similar 
asynchronous messaging systems, audio messages may be embedded wholly within the 

30 transmitted message. Altematively, a hyperlink or other suitable reference to the audio 
message may be provided within email message. Regardless of whether the audio 
message is provided in total or by reference, the audio message may be played 
immediately or stored on a storage medium for later replay. Audio messages may be 
broadcast to multiple recipients, or forwarded between recipients as required. Messages 

35 may be automatically transmitted to certain recipients based on predetermined rules, for 



wo 01/57851 



PCT/AUOl/00111 



8 

example, a birthday message on the recipient's message. In other embodiments, 
transmission of an audio message may be replaced by transmission of a text message 
which is converted to an audio message at the recipient's computing teiminal. The voice 
in which the transmitted text message is to be read is preferably able to be specified by 
5 the sender. Preferably, transmissions of the above kind are presented as a digital greeting 
message. 

Preferably, wheu said inv^tive methods are used m conjunction witfi^ctiat rooms or 
similar synchrcmous messaging systems, iiicoming and/or ou^ging messages are 

10 converted to audio messa^ in the voice of a particular diaract^. Messages exchanged in 
chat rooms can be converted directiy £rom text provided by users, which may be 
optionally derived through speech recognition means processing the speaking voices of 
diat room users. Preferably, each diat room user is able to spedfy at least to a default 
level the particular character's voice in ^^ch their messa^ are piovi&d. Jn some 

1 S embodiments, it is desirable that each user is able to assign particular diaracter* s voices to 
other chat room users. In other embodiments, particular chat room users may be 
automatically assigned particular character's voices. In ttiis case, particular chat rooms, 
would be notionaUy populated by characters having a particular tbenm (for exairple, a 
chat room populated by famous American political figures). 

20 

Preferably, the inventive methods are used in conjunction with grai^cal user interfaces 
such as provided by computing operating systems, or particular applications such as the 
World Wide Web. Preferably, certain embodiments provide a navigation agent which 
uses text-based messages spokm in the voice of a recognisable character to assist the user 
25 in navigating the grai^cal interface user. 

Preferably, the methods are also able to be extended for use with otiier messaging 
systems, such as voice mail. This may involve, for example, generation of a text 
representation of a voice message left on a voice mail service. This can be used to provide 
30 or derive a text-based message on which a generated audio message can be based. 

Preferably, the n^thods can be applied in the context of recording a greeting message 
provided on an answering machine or service. A user can have a computing device 
configured, either directly or throu^ a telephone network, the answering machine or 
35 s^vice to use an audio message generated in accordance with the inventive method. 
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Preferably, a central computing device on the Internet can be accessed by users to 
communicate throu^ the telephone netwoik with the answering machine or service, so 
that the answering machine or service stores a record of a generated audio message. This 
5 audio message may be based on a text-based message provided to the central computing 
device by die user, or deduced througji speech recognition of tti& existing greeting 
me$«fige used by the aijswcring machine or service. 

Preferably, the langua^ in which the text message is entered and the language of the 
10 spoken voices is a variation of standard Engilish, such as Americanised Fiiglt fth. 

Preferably, the prosidy and accent ^itch and speaking speed) of the message and 
optionally, the selection of character is dependant upon such f actcxs as the experience 
level of the us^, the native accent of the user, tfie need (or otherwise) for speedy 
IS response, how busy the netwodc is and tihe location of die user. 

Preferably, *Voice fonts" for recognisable characters can be developed by lecoiding that 
charact^'s voice for use in a text-to-speech system, using suitable techniques and 
equipment 

20 

Preferably, many users can interact witti systems provided in accordance with 
embodiments. Preferably, a database of messages is provided that allows a user to recall 
or resend recent text to speech messages. 

25 Preferably, the invmtive methods are used to supply a regularly updated dfft^basp- of 
audio based jokes, wise-cracks, stories, advertisements and song extracts in the voice of a 
known character, based on conversion fiom a mosdy textual v»:sion of the joke, wise- 
crack, story, advertisement or song extract to audio format Preferably, said jokes, wise- 
cracks, stories, advertisements and song extracts are delivered to one or more users by 

30 means of a computer network such as the Litemet 

Preferably, prosidy can be Educed from the grammatical structure of the text-based 
message. Alternatively, prosicfy can be trained by analysing an audio waveform of the 
user's own voice as he/she reads the entered text widi all of the inflection, speed and 
35 emotion cues built into the recording of the user's own voice, this prosidic model then 
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being used to guide the text to speech conversion process. Alternatively, prosidy may be 
trained by extracting this information from the user's own voice in a speedi to speech 
system. In each of these prosidy generation me&ods, prosidy may be enhanced by 
including emotional markups / cues in the text-based noessage. Preferably, the corpus 
S (textual script of recordings that make up the recorded speedbi database) may be madoed 
up (for examqple, with esc^ codes, HTML, SABLB, XML etc.) to include descriptions 
of the^^Ouosal expresfflon used during the lecoidm 

Prrferably, a characta voice ITS generated audio f onnat file can be protected from 
10 multiple or unaudicrised use by encryption or with time delay technology, preferably by 
the use of an encoder and decoder program. 

Preferably, the invmtive methods can be used to narrate a story on the user' s conqniter or 
toy. The charact^ voices that play any or each of the characters and/or the narrator of the 
1 S story can preferably be altered by the user. Each segmmt of tiie story may be ccmstructed 
from sound segments of recorded words, phrases and sentences of the desired characters 
or (^onally partially or wholly constructed using the character TTS system. 

Preferably, die inventive metfiods can be used to provide navigational aids for media 
20 syst^ns such as the Web. Ptefi^ly, Web sites can include the use of a famous 
character's voice to assist a user in navigating a site. A character's voice can also be used 
to present information otherwise included in die site, or provide a commmtary 
complemffltary to die information provided by the Web site. The characters voice may 
also function as an interactive agent of whom the user may present queries. In other 
25 embodinients, the Web site may present a dialogue between different character 

the us^'s experirace. The dialogue may be automatically generated, or dictated by 
feedback provided by the user. 

Preferably, telephony-based navigation systems, or such as Interactive Voice Response 
30 QVR) systems can provide recognisable voices based on text provided to the system. 
Similarly, narrowband navigation systems such as provided by the Wireless Application 
Protocol (WAP) can altematively use recognisable voices instead of text to a user of such 
a system. 
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Preferably, embodiments can be used in conjunction with digital broadcast systems such 
as, for example, digital radio and digital television, to convert broadcast text messages to 
audio messages read in a voice of a recognisable character. 

S Preferably, embodimmts may be used in conjunction with simulated or virtual worlds so 
that, for exanq>le, text messages are spoken in a recognisable voice by avatars or oflier 
represCTted entities witiiin such environTnent«t. Pleferably, avatars in «uch cnvismij^^ts 
have a visual representation which corresponds with that of Oie recognisable character in 
whose voice text messages are rendered in the enviromnent 

10 

Preferably, text messages used in relatim to embodunents of the invention may be 
marked using tags or otfaor similar notation in a markiqp language to facilitate conversion 
of the text message to that of a famous character' s voice. Sudi a defined language may 
provide the ability to specify between the voices of differrait famous characters, and 
IS difFeacent emotions in which the text is to be reproduced in audio form. Character-specific 
features may be used to provide the ability to specify more precisely how a particular text 
message is rendered in audio fonrL Preferably, automated tools are provided in con^utmg 
environments to provide these fuiK:tions. 

20 Preferably, oobodimmts of the invration can used to provide audio messages that are 
synchronised with visual images of the diaracter in whose voice the audio message is 
provided. In this respect, a digital rqiresentadon of the character may be provided, and 
their represented facial expressions reflect the sequence of words, expressions and other 
aural elements "spoken" by that character. 

25 

Preferably, embodiments may be used to provide a personalised message to a user by way 
of reference, for exan^>le, to a Web site. Preferably, the personalised messa^ is provided 
to the user in the context of providing a gift to that user. Preferably, the message relates to 
a greeting made from one person to another, and is rmdered in a famous character's 
30 voice. The greeting message may represent a dialogue between diffeimt famous 
characters which refers to a specific type of greeting occasion such as, for example, a 
birthday. 
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Preferably, in the described embodiments of the invention, generally use of one voice is 
described However, embodiments are in general equally suited to the use of multiple 
voices of different respective recognisable characters. 

5 Preferably, embodimaoits can be used in a wide variety of different applications and 
contexts than those specifically referred to above. For example, virtual news readers, 
audio oqmic strips, multimedia presentations, gr^hic user int^ace prompts etc can 
incorporate text to speecdt functioiiaUty in a^ 

10 Preferably, the above methods can be used m conjum^tion with a toy which can be 
connected with a conq)uting device, either dhecfly or tfarougli a network. Fteferably, 
when a toy is used in conjunction with a contputing device, the toy and the computing 
device can be used to share, as appropriate, tfie functionality required to achieve the 
inventive methods described above. 

15 

Accordingly, the invention further includes coded instructions interpretable by a 
computing device for p^orming the inventive methods described above. The invention 
also includes a computer program product provided on a medium* the noedium recoiding 
coded instructions interpretable by a computing device which is adapted to consequratly 
20 perform the inventive methods described above. The mvention further includes 
distributing or providmg for distribution throug^i a network coded instructions 
interinetable by a conq>uting device for periTorming in accordance with the instructions 
the invCTtive methods described above. The invention also mclucfes a computing device 
performing or adapted to perform the inventive methods described above. 

25 

According to a fourth aspect of the invention there is provided a toy comprising: 
speaker means for playback of an audio signal; 
memory means to store a text-based message; and 

controller means operatively coimecting said memory means and said speaker 
30 means for generatmg an audio signal for playback by said speaker means; 

wherein said controller means, in use, generates an audio message which is at least 
partly in a voice representative of a character generally recognisable to a user. 

35 According to a fifth aspect of the present invention there is provided a toy comprising: 
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Speaker means for playback of an audio signal; 
msmary means to store an audio message; and 

controller means operatively connecting said memory means and said 
speaker means for generating said audio signal for playback by said speaker 
5 means; 

wherein said controller means, in use, genmtes said audio message which 
is at least partly in a voice iepn»^ of a characi^' generally recognisable to^a 
user. 

10 Preferably, the toy is ad^ted to perform, as applicable, one or moTB of the prefened 
methods described above. 

Preferably, the controller means is operattvely connected with a comiection means which 
allows the toy to communicate with a computing Ajvice. Preferably, tiie confuting 
15 device is a computer which is connected with the toy by a cable via the connection means. 
Alternatively, the connection means may be adapted to provide a wireless connection, 
either diiecdy to a computer or through a network such as the Xhtemet 

Preferably, tiie connection means allows text-based messages (such as email) or recorded 
20 audio messages to be provided to the toy for playback through the speaker means. 
Alternatively, the connection means allows an audio signal to be provided directly to the 
speaker means for playback of an audio message. 

Preferably, the toy has the fonn of the character. Preferably, flie toy is adapted to move its 
25 mouth and/or other facial or bodily features in response to the audio messagp. Preferably, 
movement of the toy is synchronised with predetermined speech events of the audio 
message. This might include, for exanq)le, the start and end of words, or the use of certain 
key phrases, or signature sounds. 

30 Piefmably, the toy is an electronic hand-held toy having a microprocessor-based 
controller means, and a non-volatile m^ory means. Preferably, the toy includes 
functionality to allow for recording and playback of audio. Preferably, audio recorded by 
the toy can be converted to a text-based message which is then used to generate an audio 
message based on the text-based message, which is spoken in a voice of a generally 



wo 01/57851 



PCT/AUOl/00111 



14 

lecognisable character. Ptefi^ied features of the inventive method described above 
analogously apply where appropriate in relation to the inventive toy. 

Alternatively, when the toy includes a connection means, an audio message can be 
S provided directly to the toy using the connection means for playback of the audio 
messa^ through the speaker means . In this case, the text-based message can be ccmverted 
to an audio message by a computing device with wJiich the toy is connected, either 
' diiecfly ili^i^^ a netwoik*^ as1^ JnVsxmA. The audio mes»^"pt^^ 
is stored in the memory means and reproduced by die speaker means. The advantage of 

10 this configuration is diat it lequiies less pnx:esstng power of the controller means and less 
stora^ capacity of the menoory means of the toy. It also provides greats: flexibility in 
how the text-based message can be converted to an audio message as, for exao^le, if the 
text to audio procesang is performed on a central computing device connected on die 
Internet, software executing on the cmtral computing device can be modified as required 

IS to provide enhanced t^ to audio functionality. 

According to a sixth aspect of the invention there is provided a system for g^erating an 
audio message which is at least partly in a voice representative of a character 
generally recognisable to a user, said system comprising: 
20 means for transmitting a message request over a communicaticnis network; 

message processing means for receiving said message request; 
wherein said processing means processes said message request and 
constructs said audio message that is at least pardy in a voice rqxresentative of a 
character generally recognisable to a user and forwarding the constructed audio 
25 message over said communications network to one or more recipients. 



According to a seventh aspect of the present invention tiiere is provided a method for 
graerating an audio message which is at least partly in a voice representative of a 
character generally recognisable to a usa; said method comprising the following 
30 steps: 

transmitting a message request over a communications network; 
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processing said message request and constructing said audio message in at 
least partly a voice representative of a character generally recognisable to a user; 
and 

forwarding the constructed audio message over said conmiunication 
5 network to one or more recipients. 

AccQidmg to an d^th aspect of the invention there is provided a method of ^eratmg 
an audio message, coniprising the steps of: 

providing a request to ^ecate said audio message in a {Hedetermined format; 
10 g^erating said audio message based on said request; 

wherein said audio message is at least partly in a voice which is representative of a 
character ^nerally recognisable to a user. 

Brirf Description of thu I>rawiiigs 

Figure 1 is a schematic block diagram showing a system used to construct and deliver an 
15 audio messai^ according to a first embodiment; 

Figure 2 is a flow diagram showing the steps involved in converting text or speech input 
by a sender in a first language in a first language into a second language; 

Figure 3 is a schematic block diagram of a system used to construct and deliver an audio 
message according to a further embodiment; 

20 Figure 4 shows examples of text appearing on screens of a processing terminal used by a 
sender. 

Figure 5 is a flow diagram showing a generally process stqps used by the present 
invention; 



Figure 6 is an example of a template used by a sender in order to construct an audio 
25 message in the voice of a famous person; 
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Hguie 7 is a schematic diagram showing examples of drop down mraus used to construct 
an audio message; 

Figure 8 is a flow diagram showing processes involved for when a word or i^uase is not 
to be spoken by a selected famous character; 

5 Hgure 9 is a flow diagram showing process , steps used in accordance with a natural 

language convearsf on system; *^ ^ - ^^**t^u!\ii^c^^*^\i:-.is., r. ^ v j ^jm 

Figure 10 is a flow diagram showing process steps used by a user to construct a message 
using a speech interface; 

Hgure 11 is a schematic diagram of a wd> page accessed by a user wishing to construct a 
10 message to be received by a recipient; 

Hgure 12 is a sdiematic diagram showing a toy connectable to a conq>uting processing 
means that may store and play back messages recorded in a voice of a famous character. 



Detailed Description of prefCTred cmhnHimmfiB 

IS Various embodiments are described below in detail. The system by which text is 
converted to speech is refmed to as the ITS system. In certain embodiments, the user can 
enter text or retrieve text which repiesmts the written language statemoits of the audible 
words or language constructs that the user desires to be spoken. The TTS system 
processes this text-based message and performs a conversion operation upon the message 

20 to generate an audio message. The audio message is in the voice of a character that is 
recognisable to most users, such as a popular cartoon character (for example, Homer 
Simpson) or real-life personality (for example, Elvis Presley). Alternatively 
"stereotypical" characters may be used, such as a "rap artisf (e.g. Puffy), whereby the 
message is in a voice typical of how a rap artist speaks. Or the voice could be a "granny" 

25 (for grandmother) "spaced" (for a spaced-out drugged person) or in a "sexy" voice. Many 
other stereotypical character voices can be used. 
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The text to audio conversion operation converts the text message to an audio format 
message representing the message, spoken in one of several well known character voices 
(for example, Elvis Presley or Daffy Duck) or an impersonation of the character's voice. 
In embodin^ts that are implemented in software, the chosen character is selected firom a 
5 database of supported characters, either automatically or by the user. The convemon 
process of generating an audio message is described in greater detail below under tte 
he^^S "TTS System**. & the toy 'embodiment, the voice is desirably cQx;2^tib]e with the 
visual design of tte toy and/or the to^'s accessories such as clip-on coitipohents. Thfe tisbr 
can connect the toy to a compatible conqiuter using the connection means of the toy. The 
10 software preferably downloads the audio format message to the user's conqratible 
computer which in turn transfers the audio format message to non-volatile memory on the 
toy via the connecting means. The user can unplug the toy firom the compatible coiiq>uter. 
The user then operates the controlling means on Ifae toy to play and replay flie audio 
format message. 

15 

Software can download the audio format message to the user's conqpatible coniputer via 
tiie Ihtmiet and the connected modem. Hie audio format message is in a standard 
computer audio format (for exanqple, Microsoft's WAV or RealAudio's AU fonnats), 
and the message can be rq>layed tfarou^ tiie compatible computer's speakears using a 
20 suitable audio replay software package (for exan^le, Microsoft Sound Recorder). 

TTS system 

In the preferred embodiments, a hybrid TTS system is used to perform conversion of a 
text-based message to an audio format message. A hybrid TTS system (for example, 

25 Festival) combines the best features of Bmited domain slot and filler TTS systems, unit 
selection TTS systems and synthesised TTS systems. Limited domain slot and filler TTS 
systems give excellent voice quality in limited domams, unit selection TTS systems give 
very good voice quality in broad domains, but require lar^ sets of recorded voice data. 
Syntbesized TTS systems provide very broad to unlimited text domain coverage fii>m a 

30 small set of recorded speech elements (for example, diphones), however suffer from 
lower voice quality. A unit selection TTS system is an enhanced form of Concatenative 
TTS System, whereby the system can select large (or small) sections of recorded speech 
that best match the desired phonetic and prosodic structure of the text 
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It should be appieciated, however, that concatenative or synthesised TTS systems can be 
used instead of a hybrid TTS system, hi the preferred embodiments, the activation of 
each component of the hybrid TTS system is optimised to give the best voice quality 
possible for each text message conversion. 

5 

Concatenative TTS system 

In the pieferred embodhrumts, a (X)ncate^ ITS system may aitmiatively be lised to 
p^onn conversion of a text-based message to an audio format messa^ instead of a 
hybrid TTS system. In tihds process the text message is decoded into unique indexes into a 

10 database, herein called a "supported word-base", for each unique woid or phrase 
contained within the message. The character TTS system then prefwably uses these 
indices to extract audio format samples for each unique word or pbrsse ficom the 
supported word-base and concatenates (joins) these sanq>les toj^ther into a single audio 
format message which represents the complete spoken message, whereby said audio 

IS fcmnat samples have been pie-recorded in the selected cbaracter*s voice or an 
impersonation of the selected character's voice. 

The character TTS system software may optionally perform processing operations iqxm 
the individual audio format samples or the sequence of audio format sanq>les to increase 

20 the intelligibility and naturalness of the resultant audio format message. Preferably, the 
processing may include prosody adjustment algorithms to inoprove the rate at which tiie 
spoken audio format samples are recorded in the final audio foimat m^^^pt and the gaps 
between these samples such that the complete audio format message sounds as natural as 
possible. Other optional processing steps include intonation algorithms which analyse the 

25 grammatical structure of the text message and continuously vary the pitch of the spoken 
message and optionally, the prosody, to closely match natural speech. 

Synthesised TTS system 

Whilst a hybrid TFS system is desirable, a synthesised TTS system can also be used. 

30 

A synthesised TTS system uses advanced text, phonetic and grammatical processing to 
enhance the range of phrases and sentences understood by tfie TTS system and relies to a 
lesser extent on pre-recorded words and phrases than does tho concatenative ITS system 
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but rather, synthesises the audio output based on a stoied theoretical model of the selected 
character's voice and individual phoneme or diphone recordings. 

Shown in Kgure 1 is a system used for generating audio messages. The system generally 
S includes a conoutmmications network 4 which may be either the Internet or a PSTN for 
example to which is linked a computing processing means 6 used by a message sender, a 
yo i i. > ^ v.N^mputiQgiPX^Pcessmgn^a^ 10 
that may have its own storage means 12 or be associated with a further database 14. 
Generally when a user wishes to send a message that may include background effects or 

10 be in a voice of a well known character they wwild type in their message on computing 
processing means 6 which is thm transmitted to server means 10 that may have a text to 
speech conversion unit incorporated tticrem to convert the text into speech and 
substituting a portion of or all of the message with speech elements that are recorded in 
the voice of a chosen well known character. These recordings are stored in eitiier 

15 database 14 or storage means 12 together witii background effects for insertion into the 
message. Thereafter the audio message is ttien transmitted to the redpientdtiier by email 
over communications network 4 to the t^minal 8 or alternatively as an audio message to 
telephone terminal 16. Alternatively the audio message may be transmitted over a mobile 
network 18 to a recipient mobile telephone 20 or mobile conq)uting processing means 22 

20 or personal digital assistant 24 which may then be played back as an audio file. The 
networic 18 is linked to the communications network 4 through a gateway (e.g. SMS, 
WAP) 19. Alternatively the sender of tiie message or greeting may use telephone 
terminal 26 to deliver their message to the server means 10 which has a speech 
recognition engine for converting the audio message into a text message which is then 

25 converted back into an audio message in the voice of a famous character with or without 
background effects and with or witfiout prosidy. It is then sent to either terminal 8 or 16 
or one of the mobile terminals 20. 22 or 24 for the recipient. Alternatively the sender of 
the message may construct a messa^ using SMS on tiieu: mobile phone 28 or personal 
digital assistant 30 or computing processing terminal 32 which are linked to flie mobile 

30 network 18. Alternatively an audio message may be constructed using a mobile termmfll 
28 and all of the message is sent to the server means 10 for further processing as outlined 
above. 

Basic text verification systenn (TVS) description 
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A feature of certain raibodiinents is the ability to verify tfiat the words or phrases within 
the text message are capable of conversion to audio voice form within the character ITS 
system. This is particularly important for embodiments which use a concatenative TTS 
system, as concatenative TTS systems may generally only convert text to audio format 
5 messages for the subset of words that coincide with the database of audio recorded 
spoken words. That is, a concatenative TTS system has a limited vocabulary. 

Ppjf€xiod' hvaio^ a Text ^eiiBdition S whiSh prb^^ 

text message when it is conq>lete or '^on the fl/* (word by worcQ. In flus way, the TVS 
10 checks each woid or [duase in the text message for audio lecordmgs of suitable speech 
units. IF there is a matching speech unit, tiie word is lefmed to as a supported word, 
otherwise it is refened to as an unsu{^Knted word. Tte TVS prefnably substitutes each 
unsupported word or phrase with a suppcorted word of similar ineaning^ 

15 This can be perfcnned automatically so that almost any t^t message is converted into an 
audio format message in which all of the words q)oken in the audio fcmnat message have 
the same grammatical meaning as the words in tte t^ message. 

Digital thesaurus based text verification system (TVS) 

20 Another feature relates to the mechanism used in the optional Text Verification System 
(TVS). In preferred embodiments, flus function is performed by a thesaurus-based TVS, 
howevCT, it should be noted diat other forms of TVS (for example, dictionaiy-based, 
supported word-base based, grammatical-processing based) can also be used. 

25 Thesaurus-based TVS preferably uses one or more large digital tiiesauruses, which 
include indexing and searching features. The thesaurus-based TVS preferably creates an 
index into the word-base of a selected digital thesaurus for each unsupported word in the 
text message. The TVS then preferably indexes the thesaurus to find the unsupported 
word. The TVS then creates an internal list of equivalent words based on the synonymous 

30 words referenced by the ttesaurus entry for the unsupported word. The TVS then 
preferably utilises software adapted to work with or included in the character TTS system. 
The software is used to check if any of the words in the internal list are supported words. 
If one or more words in the internal list are supported words, the TVS then preferably 
converts the unsupported word in the text message to one of said supported words or 
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alternatively, displays all of the supported words contained in the internal list to the user 
for selection by the usct. 

ff none of the words in the internal list are supported words, the TVS then uses each wcMd 
5 in the internal list as an index back into said digital thesaurus and repeats the search 
preferably, produdng a second larger internal list of words witii similar meaning to each 
of die words in the original internal list. In this way, the TVS continues to exfmd its 
'^iefflcb snip{K>rfied^ words uiitil eitiiiear' a si^^ wcnid i^ found C^^S^ 
search depth is exceeded If the predetennmed search depth is eaiceeded, the TVS 
10 preferably repcxts to the user that no equivalent word could be found and tiie wet can be 
prompted to entcc a new word in place of the unsupported wont 

It should be noted that correct spelling of each word in the text message, prior to 
processmg by die TVS is iiiq^jxrtant and a spelling che^^ 
15 included as part of die software or preferably as part of die TVS. 

Optionally, die TVS may provide visual feedback to die user which highli^ts, such as by 
way of colour coding or odier hi^ghting means, die unsuppoaled words in the text 
message. Siqnxxrted word options can be displayed to die user fw 
20 preferably by way of a drop down list of supported words, optionally hi^ghting the 
siqyported word diat die TVS deternimes to be die best fit for d^ 
intmds to replace. 

The user can tiien select a suppcMted word ftom each of said drop down lists, thereafter 
25 instracting the software to complete the audio conversion process using the user's 
selections for each unsupported word in die original text mmaer. 

It should be noted tfiat improved results for die TVS and character TTS system can be 
obtained by providing some granunatical processing of sentences and phrases contained 
30 in the text message and the digital thesaurus being extended to include common phrases 
and word groups (for exan^jle, 'Vill go", **to do", ''to be") and said supported word-base 
to include such phrases and word groups, herein called supported phrases. 

In diis case, die TVS and character TTS system would first attempt to find supported or 
35 synonymous phrases before performing searches at the word level. That is, supported 
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words, and their use within the context of a supp(»ted word-base, can be extended to 
include phrases. 

.TVS enhancements 

5 A further feature provides for multiple thesauruses within the TVS. The thesauruses are 
indspactently configu[r;r«) bias searches towards specific words: ::rri phrase that 
produce one or a plurality a specific effects. The character TTS system may ibi tlus 
embodunent, be optionally configured such tfiat supported words within the wc»cd-base 
are deliberately not matched but rather sent to the TVS for matching against equivalrat 
10 supported words . An example effect would be ^^Sp-hop" whereby when a user entered a 
text message as follows, ''Hello my Mend. How are yoaT\ the I£p-hop effect method of 
the TVS would convert the text messagp to "Hey dude. How*s it han^g manT*, 
dieieafier, the character TTS system would convert said second text message to a spoken 
equivalent audio format message. 

15 

Additional effects can be achieved usnig the thesaurus-based TVS by addmg different 
selectable thesauruses, whereby each thesaurus contains words and phrases qpedfic to a 
particular desked effect (for example. Rap, Net Talk etc.). 

20 Prefenped language 

The language in which the text message is entered and the language of die spokra voices 
is a variation of standard English, such as Amedcanised "RngHRh Of course, any other 
langua^ can be used. 

25 Language conversion 

A language conversion system (LCS) can be used witti certain embodiments to convert a 
text message in one language to a text message in anodier language. The character TTS 
system is consequently ad^ted to include a supported word-base of voice samples in one 
or more characters, speaking in the target language. 

30 

Thus a user can convert a message fixmi one language into another language, wheiein the 
message is subsequently converted to an audio format message, representative of the 
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voice of a character or personality, such as one well known in the culture of the second 
target language. 

Furthemoie, the Speech Recognition (SR) system described elsewhere in this 
5 specification can be used in conjunction with this feature to provide a front end for the 
user that allows construction of the text message in the first language by recording and 
decoding of the 

text message then being processed by die ICS, character ITS systieDa and opdoniiily die 
TVS as desoibed above. This allows a user to speak a message in his own voice and have 
10 said message ccmverted to an equivalent message in anothor language, whereby the 
foreign language message is spoken by a well known character or personality (for 
example, in the case of French, the French actor Gerard Dq^ardieu). Of course, diis 
foreign language ability can be udUsed with emaQ or cAmn messaging system to send and 
reed ve foreign message emails in the context of the described system. 

15 

Thus shown in Hgure 2 is an exanq>le of steps that are taken in such language conv^on. 
Specifically whra a user wishes to construct a messa^ at step 40 they can either type in 
the text of the message in their native language at step 42 which is then forwarded to a 
language conversion program which may reside on the server means 10 wha»by that 

20 program would convert the language of the inputted text into a second language which 
typically would be the native language of the recipient at step 44. Alternatively the 
message sender may use a terminal 26 to dial up the server 10 whereby they input a 
message orally which is recognised by a speech recognition unit 46 and reduced to a text 
version at step 48 whereby it is then converted into the language of the recipient at step 

25 44. Both streams then feed into step 50 whereby the text in the second language of the 
recipient is converted to speech which may include background sound effects or be in the 
voice of a well known character, typically native to the country or language spoken by the 
recipient and may then optionally go throu^ die TVS unit at step 52 and be received by 
the recipient at step 54. 

30 

Non-human and user constructed languages 

It should further be noted that some characters may not have a recognisable human 
langua^ equivalent (for example, Pokemon monstm). The tfiesaurus-based TVS and the 
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character TTS system of the preferred embodmients can optionally be configured such 
that the text messagp can be processed to produce audio sounds in the possibly 
constructed langua^ of the subject character. 

5 Furthennore, another feature involves providing a user-customizable supported word-base 
within the character TTS system, the customizable supported word-base having means of 
aUowing the user to define which words in the customizable supported word-base are to 
be supported words and aidditionally, means of aUowidg di^ usdr^tbl^^ the 
supported word-base> audio format speech samples to provide suitable recorded speech 
10 units for each siqyported word in said siqiported word-base. Said audio format speech 
saiiq)les can equally be recordings of the user's own voice, or audio format samples 
extracted from other sources (for exanqde, recordings of a tdevision series). 

This allows a user or an agent on bdialf of a plurality of users to chose or design thdr 
IS own characters with a non-human or semi-human language, or to design and record the 
ai^o sound of the ratirety of the diaracter* s spoken language and to identify key human- 
language words, phrases and sentences that a user will use in a text message, to trigger the 
character to ^eak the correct sequence of it's own language statrairats. 

20 By way of eKanq>le, consider the popular Pokenum character Pikachu which speaks a 
language made xxp of diffi^cCTt intcmatioDs of segmcaits of its own name. A user or an 
a^nt (for exanqple, Pokemon writer) could configure an embodiment having a supported 
wordrbase and corresponding audio format speech samples as follows: 

25 Hello "Peeekah", 

I "Ppppeeee", 

WiU ^XahKah** 

Jump "PeeeChuuuChuuu". 
When die user enters the text message **Hello, I will jurr^", the character TTS system 
30 causes the following audio format message to be produced Teeekah Ppppeeee KahKah 
PeeeCSiuuuOiuuu". Furthermore, the TVS effectively provides a wider range of text 
messages thai an embodiment can convert to audio format messages than would a syst^ 
without a TVS. For exanople, if a user were to enter the following text message, 
^Welcome, I want to leap*', the TVS would convert said text message to *Hello, I will to 
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jump". Hiereafter, the user could delete the unsupported word "to", consequently 
resulting in the graeration of the same audio format message as previously described. 

Radical prosidy conversion 

5 When a text message is converted to a v<nce message via flie ITS system, the prosidy 
Y|ritch and spealdng speed) of the message i^^ : 
pfevibusly described 

messa^ to be variable, depending upon £BCtors, such as: 

10 • the experience level of the user 

• native accent.of die user 

• the need for speecty response 

• how busy die networic is (faster response = higjbier througjbiput) 

IS This feature is particulady iqypnqniate for users of telephony voice menu systems (for 
example, interactive voice response) or IVR systems and odier repeat use applications 
such as banking, credit card payment systems, stock quotes, movie info lines, weather 
reports etc. The experience level of the user can be determined by one of or a combination 
of the foUowing or other similar means: 

20 

• Selection of a menu item eariy in the transaction 

• The speed or number of *T)arge in" requests by the user 

• Remembering the user's identification 

25 Consicter an exan^le of a user rings an automated bill payment jdione number and 
follows the voice prompts which are given in a famous character's voice. The user hits 
the keys faster than average in response to the voice prompts so that the system responds 
by speeding up the voice prompts to allow the user to get through die task quicker. 

30 Alternative prosidy generation methods 

Typically, prosidy in TTS systems is calculated by analysing the text and flying 
linguistic rules to determine die proper intonation and speed of the voice output One 
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BDiethod has been described above which provides a better approximation for the conect 
prosodic model. The method previously described is suitable for ^hcations requiring 
speech to speech. There are limitations in this method however. For appUcations where 
the prosodic model is very important but the user can carefully constmct a fixed text 
5 message for synthesis, such as in web site navigation or audio banner advertising, another 
method of prosidy generation (called prosidy training) can be provided whereby tiie 
prosodic TQpdel is determined by analysing an audio waveform of the user's ow*:- voice as 
he/shd reads the entered text with all of the inflection, speed and emdtidihtues built mt6 
tfie recording of tiie user's own voice. However, in this situation, rather than using the 
10 voice recognition engine to generate the text, for input into the TTS system, the text 
output fiom the voice recognition engine is discarded. This reduces ttie error rate qyparent 
in the text to be streamed to the TTS system. 

An additional mediod of producing better prosodic models for use in TTS s^tems is 

IS simflar to the prosidy training method described above but is suitable for use in STS 
systems. Jn an STS sysbem^ the user's voice input is required to genearate the text for 
am version by the TTS system to a character' s voice. The reccnded audio file of the user's 
input speech can thus be analysed f cnr its prosodic model whidi is subsequently used to 
train tbe TTS system's prosodic response as described above. Effectively, this method 

20 allows die STS system to mimic the user' s original intonation and speaking speed. 

Yet another method of producing better prosodic models for use in TTS systems involves 
marking up die input text with emotional cues to the TTS system. One such markup 
language is SABLE which looks similar to HTML. Regions of die text to be converted to 
speech that require specific en^hasis or emotion are marked with escape sequences that 

25 instruct the TTS system to modify the prosodic model from what would otherwise be 
produced For example, a TTS system would probably ^erate the word 'going' with 
rising pitch m tfie text message *'So whero do you think you're going 7". A markup 
language can be used to instruct the TTS system to generate the word 'you're' with a 
sarcastic emidiasis and the word 'going' with an elongated duration and falling pitch. This 

30 markup would modify the prosidy gweration phase of the TTS or STS system. Whilst 
this method of prosidy generation is prior art, one novel extension is to include emotion 
markups in the actual corpus (the corpus is die textual script of all of the recordings that 
make iq> the recorded speech database) and lots of dijfferent emotional speech recordings 
so that die recorded speech database has a large variation in prosidy and the TTS can use 

35 the markups in the corpus to enhance the unit selection algorithm. 
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Markup language 

Marbq) languages can include tags that allow certain text expressions to be spoken by 
particular chaiacteis. Emotions can also be expressed witbin the marioed up text that is 
5 input to the character voice ITS system. Scnne exanq>Ie emotions include: 

• Shoutmg 

• Angry 

• Sad 

10 • Relaxed 

• Cynical 

Text to speech markup fiinctk>ns 

In addition to the methods described above for marlfiyf g up text to indicate how the text 
IS message should be converted to an audio fDle, a toolbar function or mssm or rig^t mouse 
click sequence can be provided for iix^hision in one or more standard desktop qyplications 
where text or voice processirig is available. Ibis Ux>lbar or menu or ri^t click sequence 
would allow the user to easUy mark secfious of the text to hi^^t the character tiiat will 
speak the text, the emotions to be used and other annotations, for example, background 
20 effects, embedded expressions etc. 

Fbr example, the user could higUig^t a section of text and press the toolbar character 
button and select a character firom the drop down list This would add to the text, the 
Qddden) escape codes suitable for causing the character TTS system to speak those words 
25 in the voice of the selected character. likewise, text could be higjilighted and the toolbar 
button pressed to adjust the speed of the spoken text, the accent, the emotion, the volume 
etc. Visual coding (for example, by colour or via charts or graphs) indicate to the user, 
where die speech markers are set and what they mean. 

30 Message enhancement techniques 
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A further aspect relates to the method of encoding a text message with additional 
information to allow tiie character ITS system to embellish the audio format message 
thus produced, with extra characteristics as described previously. Such embellishments 
include, but are not limited to: voice effects (for example, •'underwater''), embedded 
5 expressions (for exanq>le, 'Hubba Hubba"), embedded song extracts and switching 
characters (for example, as described in the story telling aspect ). The method involves 
embedding within tte text messagp, esccpe sequences of pre^fined characters to elTow 
the character TTS system, thus reading suid text message to read sequences of lettdrs thus ^ 
contained between said escape sequences, as special codes which are cansequmtly 
10 interpreted independently of the character TTS syst^'s normal conversion process. 

The embedding of canned expressions in the audio stream of speech produced from a 
TTS system is described above. Embedded expressions may be either mserted (for 
exanq>le, clapping, "doh'^ etc.) or they may be mix inserted whesc they become part of the 
1 S background noise, begiiming at a certain point and proceeding for a certain period of time 
(for example, lau^ter whilst speaking, background smg extracts etc.) or for the 
complete duration of the message. 

Shown in Hgure 3 is a system that can be used to allow a tele^Kme subscriber to create a 

20 message, for another user that may be m dieir own voice, the voice of a well known 
character and may include an introduction and end to tiie message together with any 
background sound effects. Specifically the sender may dtfaer use a nM>hile telephone 200 
or a PSTN phone 202 both of which are linked to a communications network which may 
be the PSTN 204 and whmby the mobile telephone 200 is linked to tfie PSTN 204 

25 through a cellular network 206 and appropriate gateway 207 (either SMS or WAP) via 
radio link 208. Thus either a voice message or text message may be transmitted. The 
PSTN 204 has various signalling controlled through an intelligent network 210 and 
forming part of the PSTN is a message managem^at centre 212 for receiving messages 
and a server means 214 that arranges the constniction of Oe messa^ together with 

30 background effects and/or in a modified form such as flie voice of a famous person. 
Either or both the MMC 212 and server means 214 may be a message processing means. 
The server means 214 receives a request from the message management centre 212 which 
details the voice and any other effects the message is to have prior to constniction of the 
messa^. The messa^ management centre (^OMQ 212 uses an mput correction database 

35 209 to correct any parts of the audio n^ssage or text message received and a phrase 
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matching database 211 to correct any phrases in the message. Hie MMC 212 has a text to 
speech conversion unit for converting any SMS messa^ or text message jBnom the user 
into an audio message before it is passed onto the server me^ms 214. Once the request is 
received by the server means 214 it constructs the message using background effects from 

S audio jGIes stored in sound effects database 215 and character voice, with correct piosidy, 
in the type of message requested using character voice database 213. An audio mixer 221 
may dso he used. TTius when a user 200 wishes send a mess^ige to ano(3ier user who 
may be using a further mobile telephone 216 or a Sxed PSTN phone; die' sender will 
contact the service provider at the message management centre 212 and after veri^g 

10 thdr user ID and password details wiU be guided tfarou]^ a step by step process in order 
to record a message and to add any special effects to that message. Thus die uso* will be 
provided with options, ^endly through an IVR system, in respect of the following 
subjects; 



15 • to give an inqxression to Ae recipi«it of an oivironment where the sendea: is, 

for example at the beadi, at a battleground, at a spoftmg veam^ etc. 
Recordings of these specific sequences are stored in a data store 218 of the 
server means 214 or database 215 and once die desired option is selected this 
is recorded by the message centxe 212 and forwarded on to the server means 

20 214 over link 219 together with the following responses: 

• Deciding on a famous voice in which their own voice is to be delivered ficom a 
selection of well known characters. The choice is made by the user by 
depressing a specific button sequence on the i^one and this is also recorded by 

25 the message centre 212 and later forwarded onto die s^er 214; 

• Any introduction or ending that a user particularly wants to incorporate into 
their message whether that is spoken in a character voice may be chosen. 
Thus specific speech sequences may be chosen fiom which to use as a 

30 beginning or end in a character voice or constructed by the user themselves by 

leaving a message which is then converted later into the voice of their chosen 
character. 



Once all of this information is recorded by the message management centre 212 it is 
35 forwarded to the server 214 which extracts the message recorded and converts this into 
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the character selected from database 213, using the speech to speech system of the present 
invention, incorporates the chosen background effect from database 215 which is 
superimposed on the message and any introduction and ending required As 
a combined message tins is then delivered to MMC 212 and to the eventual recipient by 
5 the user selecting a recipients number stored in their phone or by inputting the destination 
phone number in response to the IVR. Alternatively, the recipients number is input at 
the start llie noessageniay be reviewed prior to deUvery and amouded^ The 
iE^sSage is then deli vered tfux)u^ the network 204 and/or !206 to tk;5 reagent's phone to 
be heard or otherwise left as amessagp on an answering service. 

10 

An altmiative to using a character voice is to not use a v(»ce at all and just provide a 
greeting such as ''Happy Birthday^ or Happy Anni vecsar/' which would be pm-recoided 
and stored in the data storage means 218 or database 213 and is selected by the user 
through the previously mentioned IVR techniques. Altraiatively a song m^ be chosen 
IS from a fiavourite radio station which has a list of top 20 songs that are recoided and stored 
in the database 213 and selected throng various prompts by a user. The servo: 214 
would then add any message fliat migjit be m a character's voice plus tfie selected song 
and delivered to the recipient 

20 With reference to Figure 4, there is shown various exan^les of text entry on a seder's 
mobile temdnal 200. The screra 230 shows a noessage required to be sent to ''John'' and 
''Mary'' m Elvis Presley's voice and says hello but iis sad. Screen 232 shows a message to 
be seat in Elvis's voice that is happy and is a birthday greeting. Screen 234 shows a 
message constructed by a service provider in the voice of Elvis that basicaUy says hello 

25 and is "cool". 

Shown in Figure 5 is a flow diagram showing the majority of processes involved with the 
present invention. At step 250 a telephone subscriber desires to oeate a new message or 
otherwise contact the service provider at step 252 and then at step 254 the subsoiber 

30 verifies theu: user ID and password details. At step 256 die subscribe is asked whether 
they are required to make administrative changes or prepare a message. If administrative 
changes or operations are required the process moves to stq> 258 where a user can 
register or ask questions, create nicknames for* a user group, create reodver groups or 
manage billing etc. At step 260 the user is prompted to either send the message or not 

35 and if a message is desired to be sent the process moves to step 262 which also follows on 
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fiom step 256. At step 262 one of two courses can be followed, one being a **static" path 
and the other being an "interactive" path. A static path is g^erally where a user selects 
an option that needs to be sent but does not get the opportunity to review the action 
wheieas an interactive process is for exanq>le IVR where the user can listen to messages 
5 and change them. Hius if the static process is requested die process moves to step 264 
where the application and delivery platfoim are extracted and at step 266 a composed 
message is decoded and the dest^ticm is decoded at ^tep 268. Thereafier al step 272 an 

' '^output mes.::age is' j^etated ba^ on the conqK>sed miessage md^di^^ '~ 
information and delivoed to the recipient at step 274 whereby the ledpieat receives and 

10 listms to the message at step 276. The recipiCTt is then given the option to interact or 
respond to that message at st^ 277 which may be done by going back to step 254 where 
a new message can be created, a reply prepared or the received message forwarded to 
anoAorusCT. Ifno interaction is required, the process is stopped at stq^ 279. 

15 If the interactive path is chosen from stq> 262 the process moves to step 278 where the 
selection of an qyplication and delivery platform is performed, the message composed at 
stqp 280 and the user pn>mpted at step 282 whether they wish to review Jf 
tbty do not then tiie process moves to step 284 where the destination or recipient 
numbex^address is selected and tiim the output message generated at step 272, delivmd 

20 at stq> 274 and received and listened to by the recipient at step 276. If at step 282 the 
message is requested to be reviewed dim at stqi 286 the output message is generated for 
die review platform using the server 214 or MMC 212 and voice database 213, the 
message reviewed at step 288 and acknowled^ at step 290 or otiierwise at step 292 the 
message is composed again. 

25 

With regard to the input of text on a mobile telephone terminal or PSTN telephone 
terminal messages may be easUy constructed tiirou^ tiie use of tenq[>latBs which are sent 
to the user firom the telecommunication provider. In mobile teleconomunications the short 
message service or SMS may be used to transmit and receive short text messages of up to 

30 160 characters in length and templates, such as that shown in Figure 6 allow easy input 
for construction of voice messages in the SMS environment. In the example shown in 
Hgure 6 this would appear on the screen of a mobile phone whereby the 160 character 
field of the SMS text message is divided into a guard band 300 at the start of the message 
and a guard band 302 at the end of the message and in between these guard bands there 

35 may be a number of fields, in this case seven fields in which the first field 304 is used to 
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provide the subscriber's name, the second field 306 denotes the recipient's tel^hone 
number, the third field 308 is the character voice, the fourth field 310 is the type of 
message to be sent, the fifth field 312 is the style of message, the sixth field 314 indicates 
any background effects to be used and the seventh field 316 is used to indicate the time of 
5 delivery of the messa^. In each of the fields 304 to 316, as shown in the expanded 
portion of the figure there may be a number of check boxes 318 for use by tte sender to 
indicate the various parts of the type of message ihcy want to construct All the user hss 
to do is maik an^^X or check the box against which of the various options tbey^widi'tdHt!^^ 
intfaefields. For example the SCTder indicated by Mary in field 304 may want to srad a 

10 message to receiver David's phone number in a character voice of Elvis Piesley with a 
birtiiday message that is happy and having a background effect of beadi noises witfi a 
message being sent between 11 pm and midnight As mentioned previously various 
instnictions may be provided by the teleconmiunications provider on how to construct 
this type of message and after it has been constructed the usor need only j^xss their said 

15 button on their mobile telephone termmal and the instructed message is received by the 
MMC 212, translated into voice and sent to s^er means 214 which constructs the 
niessage to use tfie character voice specified which is stored in the database 213 and the^ 
smt to the recipient The s^eressentiaUy strips out the X marked or checked options in 
die constructed message and ignores the other standard or static information that is used 

20 intfaetmiplate. 

AltCTiatively a tenoqplate may be solely constructed by the subscriber diemselves without 
having to adhere to the standard format supplied by telecommunications provider such as 
that shown m Figure 6. 

25 

A set of templates may alternatively be sent finom user to user either as part of a message 
or when a recipient asks 'How did you do that?* Thus instructions may be sent firom user 
to user to show how such a message can be constructed and sent using the templates. 
Any typed in natural language text as part of the construction of the message where users 

30 use flidr own templates or devise their own templates is processed in steps 264 and 266 
shown in Hguie 5 or alternatively steps 278 and 280 using the server means 14. Hius an 
audio message is delivered as part of a mapping process to the recipient whereby the input 
text speech is converted into such an audio message from the template shorthand The 
server means 14 can determine coding for the templates used including any control 

35 elements. As an example each of the fields 304-316 have been devised and set by the 
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server means 214 or MMC 212 to depict a particular part of the message to be constructed 
or other charactwistics such as the lecipients telephone number and time of delivery. The 
server means (or alternatively MMC 2li2) can determine a dictionary of words that fit 
within the template structure for example for voice, Elvis can equal Elvis Presley, Bill can 
5 equal Bill Clinton or for example the type of messa^ BD = birthday, LU = love you. 

Hie ledpienf. of a message can edit th« SMS message and send that as a response to the 
sender or forward it on to'^a^erid or ar^other user. This is converted by the server mean^ ^ ^^^"^ ■ 
to resend a message in whatever format is required, for exanqile an angry message done 
10 witfi war sound effects as a background and sent at a different time and in a dfiffermt 
character voice. 

Alternatively pre-set inessages may be stored cm a usras phone whereby a message may 
be extracted from the memory of tfie phone by dcfiressing any one of keys on the phmie 
15 and used as part of the coristruction of tfie message to be sent to the recipient Bffectscan 
be added to a message during playback thereof at various tinoies or at various points within 
that message on depressing a key on the telqihone. For example at flie end of each 
sentence of a message a particular background effect or sound may be added. 

20 As an example of die abovementioned concepts using SMS messages, somebody at a 
footbaU sporting event can SCTd a niessage via SMS text cm their mobi^ 
in the stadium. They can simply entrar the words "team, boo" and the receivers phone 
number. After the message is pix)cessed the receiver gets a voice message in a famous 
players voice with background sound effects saying "a pity your team is losing by 20 

25 points, there is no way your team is going to win now"*. The receiver can immediately 
turn this around and send a reply by depressing one or two buttons on their telephone and 
constructing an appropriate response. Altmiatively ftey can edit the received message or 
construct a new message as discussed above. 

30 The above concepts are equally applicable to use over the Internet (communications 
network 204) whereby each of the mobile devices 200 or equivalently PDA or mobUe 
computing termmals that are all WAP enabled can have messa^ entered and sent to the 
server means 214 and constructed or converted into an audio message intended for a 
particular recipient 



35 
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A particular message constructed by a subscriber may be broadcast to a number of 
recipients whereby the subscriber has entered the respective telephone numbers of a 
particular group in accordance with step 258 of Figure 5. This may be done either 
through a teleconomunications network or through the Internet via websites. A particular 
5 tagoridentifierisusedtoidentify the group to which the message, such as a joke may be 
broadcast to and the MMC 212 and the server means 214 receives the message and 
decodes the destination data which is then used for broadcast via sn IVR select 
^ destination to each one of ttie miembers'of'ttat group. 'This in essence & ffV^ 

technique that produces a whole number of calls from one sing^ message^ For each of 
10 the recipients of the broadcast message, such a message can be reconstructed as another 
message and forwarded onto another us^ or a group of users or iq>]ied to. 

Shown in Hgure 7 is a s^es of drop down msam 350 that will typically be transmitted 
from a server means 214 tiirougb the MMC 212 to a respective mobile teiminal 200 in 

15 oiidertoaUowtheuserof thein6biletBimiiial200toconsbn^ 

expressions 352 included in each of tibe drop down menus. Thus all the uso: has to do is 
hig^diglit or select a particular expression in each window of the drop down menus to 
construct a sentence or a number of expressions in order to pass on a message to one or 
more recipients. This may alternatively be done through the Bitem^ whereby a 

20 coirq>uting terminal or a mobile phone or PDA tiiat is WAP enabled may be used to 
construct the same message. It is then forwarded and processed by the MMC 212 which 
conveits it to an audio message in the manner above described. Each message can 
include other effects such as the background sounds or expressions mentioned previously. 
Scroll bars 354 are used to scroll through the various optional phrases or parts of tiie 

25 sentence/message to be constructed. 

Another embodiment to the present invention is a system whereby words or expressions 
uttered by famous characters are scrutinised and managed to the extent that cotain words 
are not aUowed to be uttered by the particular character. In a particular context some 
30 characters should not say certain words 0£ phrases. For example a particular personality 
may have a sponsorship deal with a brand that precludes the speakmg of another brand or 
the character or personality may wish to ensure that their voice does not say certain words 
in particular situations. 
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Shown in Rgure 8 is a flow chart showing processes involved for when a word or phiase 
is not to be spoken by the selected character. At step SQ2 a prohibit list is established for 
the character or personality in a database which may be database 211 or a stoiagp means 
218 of the server means 214. In this database 21 1 would be contained a list of words or 
5 expressions that are not to be uttoped by the selected character. At step 504 the user 
inputs the words or phrase and at step 506 selects the diaracter or personality to say a 
particular word or i^irase. At step 5C$ the server means will check in ^ database the 
word or phrasd against the character or personality prohibit list iii dii^particulsor da^b^ 
211. At step 510 a quay is ascotained if the word or phrase exists in the pro 

10 the datab a se for a particular character and if so a {nohibit flag is set against that word or 
phrase as bdng not OK Ttds is done at step 512. If the word or phrase does not exist in 
the pidiibit list in the database for that particular charact^ ttien a prohibit flag is set 
against tiiat word or i^irase as being OK at step 514. After stqp 512 a substitute word or 
phiase finom a digital thesaurus, which may form part of database 209, is searched and 

15 found at step 516 and is then used in ttie text based messa^ (or audio message) and die 
process 9)es back to stq> 508. If the prdubit flag is OK as in step 514 dien ttie process 
continues and die word or phrase is used in the message and then delivoed in stq> 518. 

Shown in Hgure 9 are process steps used in accordance with a natural language 
20 conversion system whereby a user can enter or select a natural language input option from 
a drop down menu on their terminal to establish a session between the user and a natural 
language intof ace (NLO. Hiis is due at step 550. Then at step 552 the Nil loads an 
plication or user specific prompts/query engine and the NLI at step 554 pronqyts for the 
natural language user input by automated voice pronq^ts. Thus die user will be directed to 
25 ask questions or make a comment at step 556. After that at step 558 the NU processes 
the natural language input ftom the user and determines a normalized text outcome. Thus 
a natural question from a user is converted into predefined responses that are set 6r stored 
in a memory location in the server means 214 for example. At step 560 a query is asked 
as to whether there is sufficient information to proceed with a message construction. If 
30 tfie answer is yes tfien a •^proceed" flag is set to "OK** at step 561 and at step 562 
conveisionof the user input using the normaUsed text proceeds to create the message. If 
there is not enough information to proceed with the message construction then a 
Voceed** flag is set to "not OK" at step 563 and the process goes back to step 554 for 
further prompts for a natural language user input Hie above system or interface is done 
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through a telecojtmnunications system or other finee form interactive text based system, for 
example, email, chat, speech text or Internet voice systems. 

Shown in Figure 10 is process steps used by a user to construct a message using a speech 
5 interface (SI). Users wiU interface via a telephony system or other constrained interactive 
text based system which will input thcar responses to queries and convert such responses 
into ncnnalised text for finrth^ conversion into, a message via the techi-iq-iKs already 

^ -OKtlined. Thus in ste^ 600 a session is established between the tod di^' speb^ 
interface, which may be part of the server means 214 or MMC 212. At step 602 the 

10 speech interface loads the i^lication or uses specific prompts/queiy eogine and at stjcp 
604 the speech interface pronqits the user for constrained \smgti»^. user input via 
automated voice pronq>ts. At stqi 606 tfie user provides the constrained language user 
input and at stq> 608 the speech interface processes the constrained language user input 
and deterniines normalised text from this. 

15 

Examples of constrained language us^ input include die following question and answer 
sequence: 

Q: Where would you like to travel? 
20 A: Melbourne 
or 

A: I would like to go to Melboume on Tuesday, 
or 

A users says: want to create a birthday message in the voice of Elvis Piesley". 

25 

Based on the infoimaticm received the MMC 212 or server 214 determines £rom stored 
phrases and words if a messa^ can be constructed. 

At step 610 a decision is made by the MMC 212 or server 214 as to whedier enou^ 
30 information has been processed in order to construct a message. If not enouglh 
information has hem provided then at step 614 the process reverts (after setting the 
•^proceed'' flag to "not OK** at step 613) back to step 604 (where die speech interface 
prompts for further constiamed user input If there is sufficient infonnadonfix>m step 610 
the process proceeds to step 612 (after setting the "proceed" flag to *'0E7' at step 611) 
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with the conversion of the us^ input using normalised text in order to create the message. 



Expressions can be added by a What you See is What You Hear (WYSIWYH) tool 
S described in a following section or during regular textual data entry by pressmg auxiliary 
buttons, selecting menu items or by right mouse click menus etc. The expression 
information is thjK! placed as marVups (for »RQq>le, SABLE or XML) within the text to 
be sent to tfa&^diaiacter voice TTS system ' --^ ^--^ ^-^.s^i:^- -—-'^^ 

10 Laugliing, clqqping and hig^y expressive statements are examples of onbeddable 
expressions. However, the odier additional quality enhancuig features can be added 
Bac^round sounds can be mixed in with the audio speech signal to maair any 
inconsistencies or mmaturalness produced by the TTS systenL For exanq[>le, a system 
programmed to provide a TTS system characterized with Murray Walker's vcnce (Fl 

IS racing commentator) could be mixed witti background sounds of screaming Foimula One 
racing cars. A character TTS system for a sports player personality (such as for example, 
Muhammed Ali) could have sounds of cheering crowds, punching soui^, sounds of 
can«as flashing eto mixed into the background. A character TTS system for Elvis 
Presley could have music and/or smgmg mixed into tiie background. 

20 

Background sounds could include, but are not limited to, white noise, music, smging, 
people talking, normal bukground noises and sound effects of various kinds. 

Another class of technique for iniproving the listening quality of the produced speech 
25 involves deliberately distorting the speech, since imperfections in natural voice syntheses 
are more sensitive to the human ear than are imperfections in non-natural voice syntheses. 
Two metiiods can be provided for distorting speech while maintaming the desirable 
quality that the speech is recognisable as tiie target character. The first of these two 
methods involves applying post-process filt^ to the output audio signal. These post- 
30 process filters provide several special effects (for exaixq[>le, underwater, echo, robotic 
etc.). The second method is to use the characteristics of the speech signal within a TPS or 
STS system (for example, the phonetic and prosodic models) to deliberately modify or 
replace one or more components of the speech waveform. For example, the FO signal 
could be firequency shifted from typical male to typical female (ie, to a higher frequency), 
35 resulting in a voice that sounds like, for example Homer Simpson, but in a more female. 
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higher pitch. Or the FO signal could be replaced with an FO signal recorded firom some 
strange source (for exaiiq)le, lawn mower, washing machine or dog baridng). This effect 
would result in a voice that sounded like a cross between Homer Simpson and a washing 
machine, or a voice that sounds like a pet dog, for example. 

5 

Text Input, expressions and filters 

When interacting widi 1^ Wdb ate to ccmstruct petsonalised text messages for 
conversion to the chosen character's voice, die first or second user enters a Web page 
dedicated to the chosen charact^ (for exanq>le, Elvis Fiesley Page). Piefcaably, each 

10 character page is sunilar in general design and contains a message ccMistruction section 
havmg a multi-line text input Aalogue box, anumber of mpiession links or buttons, and a 
special effects scroll list The first or second user can type in the words of the message to 
be spoken in the multi-line text input dialogue box and optionally include in this message, 
specific expressions (for example, '"Hubba Hubba**, "Grnnf , Lang^) by selection of the 

IS q y prop riate expression links or buttons. 

Pre-recorded audio voice samples of these selected exjoessions aie automatically inserted 
into the audio format messa^ thus produced by the character ITS system. The text 
message or a portion of die text message may be marked to be post-processed by the 
20 special effects filters in die software by preferably selecting the region of text and 
selecting an item fiiom the special effects scroU HsL Exaoq>le effects may include, for 
example '"und^ water^ and "with a cold" effects that distort the sound of the voice as 
expected 

25 It should be noted that while the Web site is used as the preferred user interface, any otfier 
suitable us^ interface methods (fw example, dedicated software on the user's compatible 
computer, browser plug-in, chat client or email package) can easily be ad^ted to include 
the necessary features without detracting from die user's experience. 

30 By way of example, shown in Hgure 11 is a web page 58 accessed by a user who wishes 
to construct a message, which web page may reside on a server such as server means 10 
or another server linked to the Internet 4. Once the website is accessed the user is 
presented with a dialogue box 60 for the input of text for die construction of the message. 
A furtiier box 62 is used, by the user clicking on this box, which directs the user to 
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various expressions as outlined above that they may wish to insert into the message at 
various locations in that message. A further box 64 for the inclusion of special effects, 
such as '*under water'* or "with a cold" may be applied to all of or a portion of the 
message by the user selecting and blighting the particular special effect they wish the 
S message to be delivered in. The naessage is then sent to the leciinent by the user typing in 
the email address, for example for the recipient to hear the message with any expressions 
or special effects added thereto in the voice of the character a£ this paiticislar website that 
was accessed by the scncto. ^ ^ ^^^^^^^^^ : - - — 

10 

Unauthorised use of a voice 

A character voice TTS generated audio format file can be protected from multiple or 
unautfiorised use by enccyptioa or with time delay technology. It is desirable to retain 
control of use of the characters* vdces. Amongst other advantages, tins can assist in 

15 ensuring ttmt the characters' voices are not in^ropriately used or tfiat cc^yri^ts are not 
abused contrary, for example, to any agreemmt betweoi wins and a licosor entity. One 
mediod of implementing such control measures may involve encoding audio format voice 
files in a proprietary code and supplymg a decodot^player (as a standalone software 
module or browser plug-in) for use by a usct. This decoder may be programmed to play 

20 the messa^ only once and discard it from ttie user' s computer thereafter. 

Speecii to speech systems 

A logical extension to the use of a TTS system for some of the applications of our 
invention is to combine the TTS system with a speech recognition engine. The resulting 
25 system is called a speech to speech (STS) system. There are two main benefits of 
providing a speech recognition engine as a front md to the invention. 

1. The user can speak input into the system rather than having to type the input 



30 2. 



The system can analyse the prosidy (pitch and speed) of the spoken message, in 
order to provide a better prosodic model for the TTS system than can be obtained 
purely firom analysing the text This feature is optional. 
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Tbeie aie two streams of lesearch in speech recognition systems. Hiese ate: 



• SpeakCT independent untrained recognition. The stiengdi of tfiis type of 
system is that it is good at handling many diffiorent user's voices without 

S requiring the system to be trained to understand each voice. Its plications 

include telephony m^us etc. 

• Speiftex dependent tmp^A lecogcition. The strengifa of this type of systOTi ia 
tiiat ffie speech recognition system can be trained to better understand ooc or 
more specific users' voices. These systems are typically equable of continuous 

10 speech recognition from natural speech. They are suitable for dictaticm type 

plications and particularly useful for many of the applications for our 
invention, particularly email and chat 

The use of spcGoh recognition and text tQ speech systems can be advantageously used for 
IS the purpose of voice translation fix>m one character's voice (ie. user) to another 
character's voice in the same human langua^. 

To obtain a prosodic model from the spoken (is. the user's) message for use in an STS 
system, an additional UKMlule needs to be added to tfie speech recognition system, which 
20 continuously analyses the waveform for the fundamental frequency of the larynx (often 
called FO), pitch variation (for example: rising or falling) and duration of the speech units. 
This information, when combined with the phonetic and text models of tibe spoken 
message, can be used to produce a very accurate prosodic model which closely resembles 
the ^eed and intonation of the original (user's) spokm message. 

25 

Character-based stories 

The first or second user can select a story for downloading to the first user's computer or 
toy. The first user may optionally select to modify the voices that play any or each of the 
characters and/or the narrator in the story by entering a web page or other user interface 
30 component and selecting e^h character from drop down lists of supported character 
voices. For example, the story of Snow White could be narrated by Elvis Presley. Snow 
White could be played by Inspector Gadget, the Mirror by Homer Simpson and the 
Wicked Queen by Darth Vador. 
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When the software subsequently processes the story and produces the audio format 
message for the story, it preferably concatenates the story from segments of recorded 
character voices. Each segment may be constructed from sound bites of recoided words, 
phrases and sentences or optionally paitially or wholly constiucted using the character 
5 TTS system. 

Message directory 

A database of messages for a specific user's use can be provided. The Hatjibftsft contains 
10 information relating to an inventory of the messages sent and received by the user. The 
user may thereaftCT request otherwise recall any message previously sent or leceived, 
either ui c»dginal text form or audio fonnat form for the purposes of re-downloading said 
message to a conapatible conqnit^ or transiting the message to another user by way of 
the fiitmiet email systenL 

15 

In the case of a toy mibodimmit, one or moie selected audio format messages can be 
letransf ored by a user. The audio format message may have previously been tcansf ened 
to the toy but may have subsequently been erased from the non-volatile memory of the 
toy. 

20 

The database may be wholly or partially contained within Internet servers or odier 
netwodced computers. Alternatively, the database may be stored on each individual user's 
compatible computer. Optionally, the voluminous data of each audio format message may 
be stored on the user's compatible coi]q)uter with just the indexing and relational 
25 information of the database residing on the Internet servers or oth^ networked coiiq>uter5. 

Jokes and daily messages 

Another feature relates to the first or second user*s interaction sequences with the 
software via ttie Web site, and the software's consequential communications with the first 
30 user's compatible computer and in the toy embodimoit, subsequent communications with 
the first user's toy. 
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A Web site can be provided with access to a regularly updated database of text or audio 
based jokes, wise-cracks, stories, advertisements and song extracts recorded in the 
supported characters' voices or impersonations of the supported characters' voices or 
constructed by processing via the character ITS system, of the text v^on of said jotes, 
5 wise-cracks and stones. 

The Srst or second user.can interact with the Wc4; site to cause one or nrw:re of the pre- 
recorded messages to be downloaded and transfierred to the first user's cdtn^uter or, in 
toy-based embodiments, subsequmtly transferred to the first user's toy as described 
10 above. 

Optionally, the first or second user, and preferably the first user, can cause the software to 
automatically download a new jote, wise-crack, advertisement, song ratract and/or story 
at regular intervals (for example, each day) to tte first us^s coiiq>ut)CT or toy or send a 
15 notification via email of the existence of and later collecticm of the new item on the Web 
site. 

It should be noted that die database of items can be extended to odier audio productions 
asrequired. 

20 

Email and greeting cards 

A second user with a computer and Web browser and/or email software can enter or 
retrieve a text message into the software and optionally, select die character whose voice 
will be embodied in the audio format message. 

25 

The software performs the conversion to an audio format message and preferably 
downloads the audio format message to the first user. Alternatively, the first user is 
notified, preferably by email, tfiat an audio format n^age is present at the Web site for 
downloading. The first user conq>letes the downloading and transfer of the audio fcmnat 
30 message as described above. This process allows a first user to send an electronic 
message to a second user, in which the message is spoken by a specific character's voice. 



In the toy embodiment, the audio format noessage is transferred to the toy via the toy's 
connection meaiis, thereby enabling a toy, which for portability, can be disconnected from 
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the compatible computer to read an email message ftom a third party in a speciiBc 
character's voice. 

The audio file of the speech (including any expressions, effects, backgrounds etc.) 
5 produced by the TTS may be transmitted to a recipient as an attachmrat to an email 
message (for example: in .WAV or .MP3 format) or as a streamed file (for example: AU 
format). Alternatively, the audio file nsay be contained on the TTS server and a hyp^xt 
link included in tt^ body of tiie email messi^ to the recipient Wb^'tHei^pient blieks'^ 
on the hyperlink in the email message, the TTS sesycr is instnicted to then transmit the 
10 audio format file to the lecipient's compute, in a streaming or nQn-streaming format 

The audio format file may optimally be automatically played on the ledpimt's co]iq>uter 
during, or immediately following download. It may also qptionaUy be saved on the 
recipient's storage media for later use, or forwarded via anoflier emeal message to another 
IS recipient It may also utilise stieaming audio to deliver Ibo sound file whilst playing. 

The csosSi message may optionally be Inoadcast to multiple ledfumts nOfa^ than just sent 
to a single ledpirat Either the TTS server may deteimine or be otherwise automatically 
instructed as to the content of the recipient list (for example: all registira^ed usm' whose 
20 birthdays which are today) or instnicted by tiie smder on a list of recipients. 

The text for tiie »nail message may be typed in or it may be collected from a speech 
recognition engine as described elsewhere m tiie section on Speech To Speech (STS) 
systems. 

25 

In addition to sending an audio messa^ via email in a particular character voice, an email 
reading program can be provided tliat can read incoming text email messages and convert 
them to a specific character's voice. 

30 Alternatively, the email may be in the form of a greeting card including a greeting 
message and a static or animated visual image. 

Consider an example of sending an e-mail or on-line greeting card, and having the 
message spofcm m tiie voice of John Wayne, Bill Clinton, Dolly Parton, Nfickey Mouse™ 
35 or Max Snoart The sender can enter the text into the e-mail or digital greeting card When 
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the lecipent receives the e-mail or card and opens it there are famous character voices 
speaking to the recipient as if reading the text that the send^ had inserted. Tlieie could be 
one or more char^ters speaking on each card - or more than one at a time - and the 
speech could be selected to speak normally, shout, sing or laug}i and speak - with 
5 background effects and p^sonal mannerisms included. 

Ano&er feature of c^tain embodiments is a Speech Recognition (SRS) system which 
may be optionally added to the email processing system descnbed *ab6ve. Tlie SRS 
system is used by a user to convert his own voice into a text message, the text message 
10 ibestaftec being converted to a character's voice in an audio format message by the 
character TTS system. This allows a user to have a spoken message converted to another 
character's voice. 

Chat rooms 

IS Users can be allowed to interact with an Intemrt chat server and client software (for 
exanq>le» ICQ or odier IRC client software) so that users of these chat rooms and chat 
programs, referred to herein as ''chattm", can have incoming and/or outgoing text 
messages converted to audio format messages in the voice of a specific character or 
personality. During chat sessions, chatters comnainicate in a virtual room on the Ihtmiet, 

20 wherein each chatter types or otherwise records a message which is displayed to all 
chatters in real-time or near real-time. By using appr op riate software or software 
modules, chat software can be enhanced to allow chatters to select ftom available 
characters and have thdr incoming or outgoing messa^ automatically converted to fun 
audio character voices thus increasing the enjoyment of the chatdng activity. Optionally, 

25 means of converting typical chat expres^ons (for example, LOL for "laugh a lot^ into 
an audio equivalent expression are also provided. 

Hie voices in voice chat to be modified to those of specific famous characters. Input from 
a particular user can either be directly as text via input from the user's keyboard, or via a 
30 speech recognition engine as part of an STS system as described below. The output audio 
is streamed to all users in the chat room (who have character chat enabled) and is 
synchronised with the text appearing from each of the users (if apphcable). 
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A single user may either select a character voice for all messages generated by himself 
and in this scenario and each diat user will speak in his/her own selected character voice. 
Anotfa^ scenario would allow flie us^ to assign character voices from a set of available 
voices to each of die users in the chat roont This would aUow the user to listen to the chat 
S session in a variety of voices of his choosing, assigning each voice to each character 
according to his whinL He/she would also then be able to changp the voice assignmoits at 
bis/her leisure during.the chat session 

The chat user may add background effects, embedded expressions and perform other 
10 spedal effects on his or other voices in the chat room as he/she pleases. 

Hie chat room may be a character-based syslmi or a simulated 3D wodd with static or 
animated avatais representing users within the chat room. 

IS Chat rooms may be segmented based on character voice groupings ratter than topic, age 
or interests as is conmxm in dmt rooms today. This would provide different themes for 
different chat rooms (eg. a Hollywood room populated by famous movie stars, a White 
House room populated by famous political figures etc. 

20 Omsider the example of a chat session on the Internet in which you select the character 
whose voice you want to be heard. This mchides the option that you are heard as a 
diffoent character by different people. As a result your chat partrmr hears you as, for 
example, Elvis for every word and phrase you type; and you can change charact^ as 
many times as you like at the dick of the mouse. Alternatively, your chat partner can 

25 select how they want to hear you. 

Voice enabling avatars in simulated envlronnients 

This application is very similar to 3D chat in that multiple computer animated characters 
are given voice personalities of known characters. Users then de^gn 3D giin^iiatftrf 
30 worids/environments and dialogues be^veen characters within these worlds. 

An example is a user enters into a 3D world by way of a purchased program or access via 
the Internet. Within tfiis worid, the user can create environments, houses, streets, etc. The 
user can also create families and communities by selecting people and giving them 
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personalities. Hie user can ^ly spedfilc character voices to individual people in Ae 
simulated wodd and program them to have discussions with each otfier or otfaeis tiiey 
meet in the voice of the selected character(s). 

5 Interactive audio systems 

A futth^^ctuze adapts the system to wodc in coniunction with telephone ssis^p/ering 
machines and voice mail srysib^ to allow ie(x»ding of the outgoing message (OGM) 
contained within the answering machine or voice mail system. A user proceeds to cause 
an audio format message in a specific character's voice to be generated by the server 
10 means 10, for example, as previously described. Thereafter, the user is instructed on how 
to configure his answering machine or voice mail system to receive the audio format 
message and record it as the CKxM 

The meAod may differ for diff ermt types of answering machines and telephcme exchange 
IS systems. For example, flie server means 10 will prrferably dial the user's answering 
rnachine and thereafter, send audio signals specific to the codes requu»l to set s^ 
answering machine to OGM rccc»d mode and dsereafter, play die audio format n^age 
previously created by said user, over the connected telef^one line, subsequently causing 
the answering noachine to record the audio format rnessa^ as its OGM. Thereat when 
20 a timd party rings the answering machine, they will be greeted by a message of the user' s 
creation, recorded in tfie voice of a specific character or personality. 



25 interactive voice response systems 

Various response systems are available in which an audio voice prompts the user to enter 
particular keypad combinations to navigate throu^ the available options provided by the 
systeiTL Embodiments can be provided in which the voice is that of a famous person 
based on a text message genmted by the system. Sunilarly, information services (such 
30 as, for example, weather forecasts) can be read in a selected character's voice. 

Otiier navigation systems 
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iDtemet browsing can use character voices for tbt delivery of audio content For example, 
a user, utilising a WAP-mabled telq>hone or other device (such as a personal digital 
assistant) can navigate around a WAP application either by keypad or touch screen or by 
speaking into the microphone at which point a speech recognition system is activated to 
5 conv^ the speech to text, as previously described. These text commands are then 
operated upon via tfie Internet to p^orm typical Internet activities (for example: 
browsing, chatting, searp^ing, banking etc). During many of these operations, the 
feedback to the user would he greatly enhanced if it was received iii"fiiklib format and 
preferably in a recognisable voice. 

10 

Fbr such an plication, die system can be q>plied to respond to requests for output to the 
device. Equally, a system could be provided that enables a character voice TTS system to 
be used in the above defined way for delivering character voice messages over regular (ie 
non-WAP eoabled) telejdione networks. 

15 

Consider the exanq[)le of a usor who speaks into a WAP enabled ^boac to select his 
favourite search engine. He fbexi speaks into his i^one to teU the seaidi engine what to 
look for. Hie search CTgine then selects the best match and reads a summary of the Web 
site to the user by producing speech in a character voice of the user's or the site owner's 
20 selection by utilising die character voice TTS systenL 

Web navigation and Web authoring tools 

A Web site can be character voice enabled such that certain information is presented to 
25 the visitor in spoken audio form instead of, or as well as, the textual fonn. This 
information can be used to introduce visitors to the Web site, help them navigate the Web 
site and/or present static information (for exanq>le: advertising) or dynamic information 
(for example: stock prices) to the visitor. 



30 Software tools can be provided which allow a Webmaster to design character voice 
enabled Web site features and publish these features on the World Wide Web. These tools 
would provide collections of features and maintenance procedures. Example features 
could include: 
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• Character voice traming software 

• Character voice database enhancement and maintenance software 

• Text entry fields for inomediate generation of voice audio files 

• WYSIWYH (What you see is what you hear) SABLE maikup assistance and 
5 ITS robot placement and configuration tools 

• Database connectivity tools to aUow dynamic data to be gposatad for passing 
to the TTS Systran 'oit-tfie-fty' 

• Tools for adding standard or custom user inteatactive character voice features 
to web pages (for eumple, tool to allow a characto: voice chat site to be 

10 included in die web master's web page). 



The WYSIWYH tool is the primary means beywUdi a Web master can character voice 
enable a Web site. It opontes sumlarly and optionally in conjunction with otibtra: Web 
authoring tools (for example, Moosoft Frcmtpage) allowing the Welnnaster to gain 
IS imme di at e access to die character voice TTS system to produce audio files, to miaik up 
secti(His of the web pages (for exanq[>le, m SABLE) that will be delivesced to ttie Iht^et 
user in character voice audio format, to place and configure TTS robots within the web 
site, to link data-base searches to the TTS system and to configure CGI (or similar) scripts 
to add diaractcar voice TTS ftmctionality to the Web serving software. 

20 

TTS robots (or components) are interactive, Web deliverable components which, when 
activated by the user, allows him/her to intea:act with the TTS system enabled 
applications. For example, a Web page may include a TTS robot email box which, when 
Ae user types into the box and presses the enclosed send button, the message is delivered 
25 to the TTS system and the audio file is automatically sent off to the user's choice of 
recipient The WHYSIWYH tool makes it easy for the Webmaster to add this feature to 
his/her Web site. 

Note that the hitemet link ftom the Web server to the character voice TTS system is 
30 marked as optional. The character voice TTS system may be accessible locally fit)m the 
Web server or may be purely software within the Web server or on an internal network) 
or it may be remotely located on the Internet In this case, all requests and responses to 
other processes in this architecture will be routed via the Internet 
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The WHYSIWYH tool can also be used to configure a Web site to include other character 
voice enabled features and navigation aids. These may include, for example: 

• When you float over a button with the cursor, it 'speaks* the button function, 
rather than the normal text box. 

• Character voices when used in ctemo areas 
c A^dvertising 

• To automatically lecoimnend a character voice, based m a uso's known 
preferences - tibese could be asked for in^a questionnaire or, with sites that store 
historic data on users, these could be suggested (for exanqile, if a person on 
Amazon.com buys a lot of history books - it could recommend Winston CaimcMU 
as tfie navigator). Alternatively, a character's voice can automatically be selected 
for the user (for exarr^le, based on qiedfic search criteria). 

• To automatically create conv«nsations between the users preferred voice navigator 
IS (for example, die user has software that automatically makes Ifomer Sinqpson his 

navigator) and die selected navigator of the web site (S ay. Max Smart) - it creates 
an automatic conversation - Hey Hbmrar, welcome to my site - its Meol Smart 
here". 

• 

20 Consider the exanq>le of a Webmaster who updates a fiamous person's web site daily with 
new jokes and daily news by typing into tiie WHYSIWYH tool, the text of the jokes and 
news. Hie Web senar then s^es up the audio voice of the famous person to each user 
surfing flie Web who selects this page. Conversion from text to speech can be performed 
at preparation time and/or on demand for each user's request 



25 



30 



Consider the example of a famous person's Web site (a "techno" band or David 
Letterman site for example) which lets you "dialogue" with the famous person as if they 
are there just with you - all day and every day - but is actually a text operator typing out 
the return text message which converts to the famous person's voice at your end. 

Now consid^ the example of a favourite sports Web site and having a favourite sports 
star give you the commentary or latest news - then select another star and listen to them, 
then have Elvis do it for amusement 
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Set top boxes and digital broadcasting 

A set top box is the tenn given to an appliance Aat connects a television to the Internet 
and usuaDy also to the cable TV network. To assist in brand distinction, the audio 
messages used to prompt a user during operation of such a device can be custom 
5 gpierated from either an embedded character voice TTS system or a remotely located 
character voice TTS system (connected via Ihtemet or cable netwodc). 

In a digital TV application, a user can select which characters they want to speak the news 
or the weather and whedier the voice will be soft, hard, shouting ot whispmng for 
10 example. 



Other applications 

Oth^ applications incQcpoiadng raibodiments of the invention include: 

15 • Star diait readers 

• Weather reports 

• Character voice enabled comic strips 

• Animated character voice enabled comic strips 

• Talking alarm clocks, calendars, schedule programs etc. 

20 • Multi-media presentations (for example, Microsoft Powecpoint sUcfe 

introductions) 

• TaDdng books, either Web based or based on MPS handheld players or 
other audio book devices 

• Mouse tooltip annunciator 



25 



or other voice enabled applications, whereby the spoken messages are produced in the 
voice of a character, generally recognisable to the user. 



Client server or embedded archHectures 

30 Some or all of the components of the system can either be distributed as server or client 
software in a networked or intemetworked environment and die split between functions of 
server and client is arbitrary and based on communications load, file size, conq>ute power 
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etc. Additionally, the complete system may be contained within a single stand alone 
device which does not rely on a network for opoation. In this case, the system can be 
further refined to be embedded within a small appliance or other application with a 
relatively small memory and computational footprint for use in devices such as set-top 
5 boxes. Net PCs, Internet ^liances, mobile phraes etc. 

Hie most typical architecture is for all of ttie speech recognition (if aiqplicabie) to be 
performed on the client and the TTS text message convearaiQn requests to over the " ' ' 
netwodc (for example, Iht»net) to be converted by one or more servers into audio format 
10 voice messages for letum to the client or for delivery to anodier clirat conq)utBr. 

ConstrucUon of new character voices 

The character TTS system can be «hanced to facilitate tapid additions of new voices for 
dififecent characters. Methods include on-scceen tuning tools to allow the speaker to 

IS **tune** his voice to the required pitch and speed, suitable for geaoerating or adding to the 
recorded speech data-base, recording techniques suitable for storing the speech signal and 
the laringagraph (EGG) signal, mi^ods for automatically processing these signals and 
methods f w taking these processed signals and creating a recorded speech data^base for a 
specific charactor' s voice and mediods for including ttiis recorded speech data-base into a 

20 character TTS system. 

Vdce training and maintenance tools can be packaged for low cost deployment on 
desktop computers, or provided for rent via an Application Service Provider (ASP). This 
allows a recorded speech database to be produced for use in a character voice TTS 
25 system. Tlie character voice TTS system can be packaged and provided for use on a 
desktop computer or available via the Internet in the manner described previously, 
whereby the user's voice data-base is made available on an Initemet server. Essentially, 
any plication, architecture or service provided as part of this embodiment could be 
programmed to accept the user's new character voice. 

30 

As an example, the user buys from a shop or an on-line store a package which contains a 
boom mike, a laringagraph, cables, CD and head{riiones. After setting up the equipment 
and testing it, the user then runs tte program on the CD which guide's the user throug|i a 
series of screen prompts, requesting him to say them in a particular way (speed, 
3S inflection, ^notion etc.). When complete, the user then instructs the software to create a 
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new Voice font' of his own voice. He now has a lesouice (ie: his own voice database) 
that he can use with the invention to provide ITS services for any of the described 
q>pIications (for example, he could automatically voice enable his web-site) with daily 
readings from his favourite <m-l]ne e-zine). 

5 

Further, this plication allows a person to store his or her voice forever. Loved ones can 
then have your voice, read a mw book to fccro, long after tiie original person has passed 
away. As technology becomes moTB advanced, She vrnce quality wilt improve fix>m the ' 
same reccHded voice data-base. 

10 

Method for recording audio and video togetlier for use in animaOon 

Hie process of recording the character reading usually involves tiie use of a closely 
nKHinted boom microphone and a laiingagrai^. Tlie laringagraph is a device diat clips 
around the speak^'s throat and measures the vibration frequency of tiie larynx during 

IS speech. This signal is used during cfevelopment of the recorded q>eech database to 
accurately locate the pitch markers (phoneme boundaries) in the recorded voice 
waveforms. It is possible to synchronously record a video signal of die speaker whilst the 
audio signal and laringagraph signal is being recorded and for this signal to be stored 
within the database or cross referenced and held within anodier database. Hie purpose of 

20 this Mtra signal would be to provide facial cues for a ITS system that included a 
computer animated face. Additional information may be required during the recording 
such as would be obtained from sensors, strategically placed on the speaker's face. 
During TTS op^tion, this information could be used to provide an animated rendering 
of the character, speaking the words that are input into the TTS. 

25 

In operation, when the TTS system retrieves recorded speech units from the recorded 
speech database, it also retrieves the exact recorded visual information from the recorded 
visual database that coincides with the selected speech unit This information is then used 
m one of two ways. Either, each piece of video recording corresponding to die selected 
30 units (in a unit selection speech synthesiser) is concatoiated together to form a video 
signal of the character as if he/she v^ese actually saying the text as entered into the TTS 
system. This has fbe drawback however, that the video image of the character includes the 
microphone, laringagraph and other unwanted artefacts. More practical is the inclusion of 
a computer face animation module which uses only the motion c^ture elements of the 
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video signal to animate a computer g^eraled character which is programmed to look 
stylistically similar or identical to the subject dbiaiacter. 

Animation 

5 A fiirdier feature of certain embodiments involves providing a visual animation of a 
virtual or idiysLc5!l represratadon of chantcter selected for the audio voice. 
Ptefmabiy, a user could preferably design or by his agent cauise to be cfesigned a gnqp»hical 
simulation of said designed character. In toy^based embodimoits , a user could produce or 
by his agent cause to be produced, accessories for said toy for attachment thereto, said 
10 accessories bdng representative of said character. The grqducal simulation or 
accessorised toy can optionally perform the, animated motions as previously described 

Animated characters (for example Blaze can be used) to synchronise the voice or other 
sound eflfects with the movement of the avatar (movement of moudi or othor body parts) 
IS so diat a redpimt or usa experimces a combined and synchronised image and sound 
effect 

In the toy embodiment, the toy may (^onally have electrcnnechanical mechanisms for 
perfomnng animation of moving parts of the toy during the replay of recorded messages. 
20 The toy has a number of medianically actuated lugs for the connection of accessories. 
Optionally, the accessories represmt s^sed body parts, such as eyes, hat, mouth, ears 
etc. or stylised personal accessories, such as musical instruments, glasses, handbags etc. 

Ibe accessories can be designed in a way that the anangemmt of aU of the accessories 
25 upon the said lugs of the toy's body provides a visual representation of die toy as a whole 
of a specific character or personality (for example, Elvis Presley). Preferably, die lugs to 
which accessories are attached perform reciprocation or other more complex motions 
during playback of the recorded messa^. This motion can be synchronised with the 
tempo of the spoken words of the messagp. 

30 

OptionaUy, the accessories may themselves be comprised of mechanical assemblies such 
that the reciprocation or other motion of the lugs of the toy cause the actuation of more 
complex motions within the accessory itself. For exanq)le, an arm holding a teqx>t 
accessory may be designed with an internal mechanism of gears, lev^ and other 
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mechanisms such that upon redptocation of its connecting lug, the hand moves up, then 
out whilst rotating the teqK>t thra tetracts straight back to its lest position. Another 
example is an accessory which has a periscope comprising gears, levers and a concertina 
lever mechanism that upon reciprocation of its connecting lug, causes tiie periscope to 
5 extend maikedly iq)wanis, rotate 90 degrees, rotate back, then retract to its rest position. 
Various odier atran^ments are of course possible. 

in embodiments, two or three dirneosional coiiq^uter g^ 

characters may optionally be antmafert in time with the spokm audio foanat mfta^ge jn a 
10 manner which provides the impression that the animated charact» is qieaking die audio 
format message. Mote conaqplex animaticm sequences can also be provided. 

In toy embodiments, die lug or lugs wUdi rolate to the mouth accessory are actuated so 
that the moudi is opened near the beginning of each spoken word and closed near die end 
IS of each spoken word, thus providmg the inqpression dial die toy is actually speaking the 
audio fonnat message. 

The other lugs on die toy can be actuated in some predefined sequence or pseudo-random 
sequence relative to die motion of the moudi, this actuation being petfoimed by way of 
20 levers, gears and odier mechanical medianisms. A further feature allows for a inc»e 
elaborate electromechanical design whereby a plurality of electromechanical actuators aro 
located around die toy's mouth and eyes region, said actuators being indepradendy 
controlled to allow the toy to fonn complex facial expressions during die replay of an 
audio format message. 

25 

A second channel of a stereo audio input cable connecting the toy to the con^uter can be 
used to synchronously record the audio format message and the sequence of facial and 
other motions that relate to die audio format message. 

30 Toy embodiment specific aspects 

Shown in Rgure 12 is a toy 70 that may be connectable to a computing means 72 via a 
connection means 74 through link 76 that may be wireless and therefore connected to a 
network or by fixed cable. The toy 70 has a non volatile memory 71 and a controller 
means 75. An audio message may be downloaded through various software to the 
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computing means 72 via the Ihtemet for eumple and subsequCTiiy transferred to ttie toy 
ttutough the connection means 74. 

A number of features specific to toy-based embodiments are now described. In one 
S feature the audio fonnat message remains in non-voladle meoKiry 71 within the toy 70 
and can be replayed many times until the user instructs die microprocessor in the toy, by 
way of the controller means 75, to onase the message from the toy. Picefen&bljs tibe toy is 
capable^ df storing muhiple audio foUnjiriiieissages and replaying aiiy 5f ^tbi^' messages 
by operation of the controller means 75. Optionally, die toy may automatically removes 
10 old messages from the non-volatile memory 71 when there is insuffident space to record 
an incoming message. 

A furtfier feature provides that when an audio format message is transmitted from the 
software to the user's conqmter processor means 72 and subsequently transfiBned to the 

15 toy 70 by way of the onmecdng means 74, die message may optionally be encrypted by 
die software and then decrypted by the toy 70 to prevent visas from Ustraiing to die 
message prior to replay of the message on the toy 70. This encrypdon can be performed 
by reversing die time sequence of the audio format message widi decryption being 
performed by reversing the order of die stored audio fomiat message in the toy. Of 

20 course, any other suitable form of enoTption may be used. 

Another features provides that when an audio fonnat message is transmitted from the 
software to die conqmting processor 72 and siibsequendy transferred to die toy 70 by way 
of die connecting means 74, the message may optionally be compressed by the software 

25 and then decompressed by die toy 70, whether the audio format message is encrypted or 
not The reason for this compression is to speed up the recording process of the toy 70. In 
a preferred embodiment, this compression is pref mbly performed by sampling the audio 
format message at an increased rate whra transferring the audio format message to the toy 
70, dius reducing die transfix time. The toy subsequenfly, preferably interpolates between 

30 samples to recreate an approximation of the original audio format message. Other forms 
of analog audio conq[>ression can be used as appropriate. 

In another feature, the toy 70 is optionally fitted with a motion sensor to detect motion of 
people within the toy's proximity and the software resident in the toy is adapted to replay 
35 one or a plurality of stored audio format messages upon detection of motion in the 
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vidnity of the toy. Preferably, the user can operate the controller means 75 on the toy to 
select which stored message or sequence of stored messages will be replayed upon the 
detection of motion. Alternatively, the user may use the controller means 75 to organise 
the toy to replay a random message ficom a selection of stored messages upon each 
5 detectim of motion or at fixed or random periods of time following the first detection of 
motion, for a period of time. The user may optionally choose fixnn a selecdon of •^se- 
cracks'' or other audio format messages stored on the Internet perver computers for use 
with the toy's motion sensing feature. An exanqile'wise'^Tiack woiild bfe THfey "^i^^gft - 
over hoe. Did you ask to enter my room?** 

10 

A furtho* feature allows two toys to communicate directly witfi each otiier witiiout the aid 
of a compatible conqmter or Ihtemet cmnectiofL A first toy is provided widi a 
headphone socket to enable a second toy to be connected to the first toy by plugging tbe 
audio input cable of tibe second toy into tbe head^one socket of the first toy. The user of 

IS the second toy dien preferably selects and plays an audio format message stoied in the 
second toy by opoating die controlling means on the second toy. The first toy then 
detects the incoming audio fonnat message fixmi the second to^ 
in a manner similar to as if said message had been transmitted by a compatible conqiuter. 
This aUows toy users to exchange audio fotmat messages witfiout requiring die use of 

20 connecting compatible computers. 

Gift giving process 

A further feature relates to a novel way of purchasing a toy product online (such as over 
the Ihtmiet) as a gift The product is selected, Ae shipping address is entered, the billing 

25 address and payment details and a personalised greeting message is entered in a manner 
sunilar to regular online purchases. Thereafter, upon shipping of the product to the 
redpioit of the gift, instead of printing the giver's personal greeting message (for 
example, 'Happy birthday Richard, I thought this Hma Fudd character would appeal to 
your sense of humour. Ptom Peter^ upon a card or gift certificate to accompany the gift, 

30 said greeting message is preferably stored in a database on the Internet server 
computeT(s). 

The redpirat receives a card with the shipment of the toy product, containing instructions 
on how to use the Web to receive his personalised greeting message. The recipient then 
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preferably connects his toy product to a compatible computer using the toy product's 
comiectmg means and enters the Uniform Resource Locator (URL) panted on said card 
into his browse on his compatible computer. Hiis results in the automatic download and 
transfer to the recipient's toy product of an audio format message rqnesoitmg the giver's 
S personal greeting message, spoken in the voice of the character lepteseated by the 
stylistic design of the received toy product 

-ilie ledpl-snt icaii qpibrate controlling means on die toy prodiKrt t&^^lay s^d audicf ' 
format message. 

10 

Multiple users 

While the embodiments descnbed herein are generally in relation to one or two useis» 
tfiey can be of course be readily extended to encompass any number of users which are 
able to interact with the Web site, the Web softwate, character TTS, character TTS, TVS , 
IS and in die toy embodiment, multiple toys as qypropriate. 

Also, multiple toy styles or virtual con^niter graphic characters may be produced, 
whereby each style is visually representative of a different characto:. EKanqile chanicteiB 
include real persons alive or deceased, or characterisations of real persons (for example, 
20 television charactras), cartoon or comic characters, computer aniy^ ate^ characters, 
fictitious characters or any other form of character that has audible voice. Further, die 
stylisation of a toy can be achieved by modification of form, shape, colour and/or texture 
of the body of die toy. Biterchang^le kits of clip-on body parts to be added to the toy's 
lugs or other fixed connection points on die body of the toy. 

25 

A further feature allows users of a toy embodiment to upgrade the toy to represrat a new 
character witfiout the need to purchase physical parts (for example, accessories) for 
fixation to the toy. The body of the toy and its accessories diereof are (fesigned with 
regions adapted to receive printed labels wherein said labels are printed in such a manner 
30 as to be representative of die appearance of a specific character and said character's 
accessories. The labels are preferably replaceable, wherein new labels for say, a new 
character, can preferably be virtually downloaded via the Internet or oAermse obtained. 
The labels are visually representative of die new character. Tlie labels are subsequendy 
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converted fix)m virtual fonn to physical fonn by printing the labels on a conqHiter printer 
attached to or otherwise accessible from said user's conopatible computer. 

Many voices 

5 In any of the exanq>le qyplications, typically the use of one voice is described However, 
r ^ same prindpler> can be qppUed to cover msni^lhsno^ - 
one time, md^^OT o^ " 

It will be understood that the invration cSsclosed and defined in this specification extends 
10 to aD alternative comhuiati(»]s of two or moie of the individual featues mentioned or 
evidrat fixun the text or drawings. All of these different combinations constitute various 
altranative aspects of the invention. 



wo 01/57851 



PCT/AUOl/00111 



59 

CLAIMS: 

1. A method of generatiBg an audio message, conqxcising the steps of: 
S providing a text-based message; and 

gCTcrating said audio message based on said text-based message; 
wl^sin ssid audio. nisssa^ is at least partly in a voice^^vvljch is 
representative of a charact^ 

10 2. A mediod according to claim 1 wherein said character is selected firom a 
predefined list of characters, each character in said list being generally 
recognisable to a usor. 

3. A method according to claim 1 or claim 2 wherein said generating step uses 
15 a textual or encoded database which indexes speech units with corresponding 

audio recordings representing said speech units. 

4. A method according to claim 1 or claim 2 wherem said generating step 
comprises concatenating together one or more audio recordings of speech units, 

20 the sequence of flie concatenated audio recordings being determined with reference 
to indexed speech units associated with one or more of the audio recordings in said 
sequence. 

5. A method according to claim 3 further coiiq[>rising the step of substituting 
25 words in said text-based message that do not have corresponding audio recordings 

of suitable speech units with substitute words that do have corresponding audio 
recordings. 

6. A method according to any one of claims 3 to 5, wherein said speech units 
30 represent any one or more of the following: words, phones, sub-phones, multi- 
phone segments of speech. 
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7. A method according to any one of claims 3 to 6 wherein said speech units 
cover tfie phonetic and prosodic range required to genotate said audio message. 

8. A method according to claim S wherein the substituted words axe replaced 
5 with siqq[>QrtwQnls that eadblmve suitable associated a^ 

•9. A method accojft^g to any one of the piiovioiiss claiims whemu after die step ^ 
of providing said text-based message farther including the step of converting said 
text-based message into a corresponding text-based message which is used as the 
10 basis for genmting said audio message. 

10. A mettiod according to claim 9 wherdn said step of converting said text- 
based message to a corresponding text-based message includes substituting said 
original text-base message witii a corresponding text-based message which is an 

IS idiomatic representation of said original text-based message. 

11. A m^od according to claim 10 whezem said coiresponding text-based 
message is in an idiom which is attributable to, associated with or at least 
cranpatible with said character. 

20 

12. A method according to claim 10 wherein said corresponding text-based 
message is in an idiom which is intentionally incompatible with said character or 
attributable to or associated with a diffo^nt which is generally recognisable by a 
user. 

25 

13. A method according to any one of die previous claims herein said audio 
message is generated in multiple voices, each voice representative of a different 
character which is generally recognisable to a user. 

30 14. A method according to any one of claims 1 to 8 wheiein after the step of 
providing said text-based message further including the step of converting only a 
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pQilioii of said text-based message into a conesponding text-based message which 
is an idiomatic representation of the original text-based message. 

15. A method according to any one of the previous claims wherein said 
5 generation of said audio message includes randomly inserting particular vocal 

e^qpiessions or sound effects between certain predetermined audio recordings from 

. . whic^;^^ ^ : . . 

16. A method according to any one of the previous claims whoein said text- 
10 based message is generated from an initial audio message firom said user usmg 

voice recognition and subsequently used as the basis for generating said audio 
message in a voice representative of a generally recognisable dbaracter. 

17. A mediod according to any one of the previous claims further conqxrising 
IS die step of said user applying one or more audio effects to said audio message. 

18. A method according to claim 17 wh»ein said one or more audio effects 
includes changing the sound characteristics of said audio message. 

20 19. A method according to claim 17 wherem said one or more audio effects 
includes background sound effects to give the impression tiiat the voice of the 
character emanates from a particular enviromnent 

20. A system for generating an audio message comprising: 
25 means for providing a text-based message; 

means for genmting said audio message based on said text-based message; 
wherein said audio message is at least pardy in a voice which is representative of a 
character generally recognisable to a user. 

30 21. A system according to claun 20 further comprising storage means for 
indexing speech units with corresponding audio recordings representing said 
speech units. 
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22. A system accQrding to claim 21 wherein said audio message is generated by 
concatenating together one or more aadio recordings of speech miits, the sequence 
of die concatenated audio recordings bdng determined with reference to said 

5 indexed speech units associated with one or more of the audio recordings in the 
sequence. 

. •■ ... . - : ■ . . . " ' -.f 

23. A syst^ according to any one of claims 20 to 22 wherein words or 
expressions in said text-based message that do not have corre^nding audio 

10 recordings of suitable speech units are substituted with substitute words or 
substitute expressions that do have corresponding audio recordings. 

24. Amethodaccoidingtoanyoneof claims 21 to 23 wherdn said speech^ 
represent any one or more of the following: words, phones, sub-phones, multi- 

15 phc»ie segments of speech. 

25. A method according to any one of claims 21 to 24 wherein said speech units 
cover the phonetic and prosodic range required to generate said audio message. 

20 26. A system according to claim 23 wherein each substituted word or 
expression has a closely similar grammatical meaning to the original word or 
repression in the context of the text-based message. 

27. A system according to claim 23 further comprising tiiesaurus means for 
25 indexing said words or esqpressions in said text-*based message witii said substitute 

words or said substitute expressions. 

28. A system according to claim 27 wherein said words or expressions are 
substituted with substitute words or e;q>ressions that have associated audio 

30 recordings. 
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29. A system according to any one of claims 20 to 28 wherein said text-based 
message is provided by said user. 

30. A system according to claim 29 wherein said means for providing is a 
5 conoqprating processor such that said user constructs said text-based message using 

said conqputing processor and using text-based elements such as words, 
. expressiom etc selected iiom a pr 

31 . A system according to claim 30 wherein said list includes vocal eiqpEressions 
10 attributable to, associated with or at least conq)atible with said character. 

32. A system according to claim 30 or claim 31 wherdn each text-based 
element is rqxtesented in said text-based message by a q)ecific code rqiresentative 
of the respective text-based element 

15 

33. A system according to claim 32 wherein said representation is achieved by 
using a preliminary escape code sequence followed by the code rqnpesenting said 
text-based element 

20 34. A system according to claim 30 whmin one or more templates ace 
displayed on said conqmting processor, said one or more templates dqricting fields 
providing one or more options selectable by said user to create said audio message. 

35. A system according to claim 34 wherein said fields include the user's name, 
25 die recipient's name, the type of message and style of message. 

36. A system according to claim 34 or claim 35 wherein said fields include the 
voice of said character in which said audio message is to be spoken, audio effects 
and time of delivery. 

30 

37. A system according to claim 34 wherein said fields depict phrases or audio 
effects each forming a portion of said audio message. 
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38. A system according to claim 29 wherein said text-based message provided 
by said user has natural language input ficom said user which is accqpted and 
processed by a message processing means, said message processing means 

S thereafier determining a text outcome for said isxpat and constructing said audio 
mess^ based on said text outcome. 

39. A system according to claim 29 wh0:em said text-based message provided 
by said user has constrained language input firam said user that is accepted and 

10 processed by a message processing means, said message prpcessing means 
ttiereafler detemiining a text outcome for said input and constructing said audio 
message based on said text outcome. 

40. A system according to any one of claims 20 to 33 whear^ the generated 
IS audio messa^ has one cor more audio effects stored in said stoi^ 

41. A system according to claim 21 wha:ein said storage means censors 
unsuitable words for use in the generated audio message. 

20 42. A system according to any one of claims 20 to 41 further comprising voice 
recognition means such that said user utters an audio message which is converted 
by said speech recognition means into said text-based message. 

43. A Systran for generating an audio message using a communications 
2S network, said system coII^nising: 

means for providing a text-based message linked to said communications 
network; 

means for generating said audio message based on said text-based message; 
wherdn said audio message is at least partly in a voice which is 
30 representative of a character generally recognisable to a user. 
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44. A system according to claim 43 wherein said means for providing a te3rt- 
based message is a computing processor having data entry means for said user to 
enter the text-based message. 

5 45. A system according to claim 43 or 44 wherdn said genaraling means is a 
server linked to said communications network tfiat converts said text-based 
nui^ssa^ into said audio message. . 5 

46. A system according to claim 45 further comprising storage means for 
10 indexing speech units with corresponding audio recordings representing said 

speech units. 

47. A system according to claim 46 wherein said audio message is generated by 
concatenating together one or more audio recordings of speedi units, the sequence 

IS of the conca ten a t e d audio recordings bemg determined with reference to said 
indexed speech units associated with one or more of tibe audio recordings in ttie 
sequence. 

48. A system according to claim 46 wha«n said server accesses said storage 
20 means to construct said audio message at least partly in a voice which is 

rqiresentative of a character generally recognisable to said user. 

49. A system according to claim 48 wherein said storage means stores audio 
recordings of characters generally recognisable to users of the system. 

25 

50. A systrai according to any one of claims 45 to 49 wherein after constmcting 
said audio message said server transmits the audio message to the intended 
recipient over said communications network. 

30 51. A system according to any one claims 43 to 50 further comprising voice 
recognition means for converting an audio message of said us^ into said text- 
based message. 



wo 01/57851 



PCT/AUOl/00111 



66 

52. A system according to any one of claims 43 to 51 wheiein said audio 
message is g^erated with visual images of the chaiacta in whose voice the audio 
message is provided. 

5 

53. A systiem according to claim 52 whei:ein said aiulio message and said visual 
images synchronised whereby jhe fasial eiqpressions of the charact^ reflect a:e 
sequence of words, expressions and other aural elements spoken by said charact^. 

10 54. A system according to any one of claims 44 to 53 wherein said computmg 
processor is a mobile tmninal linked to said communications network through a 
fiirttier communications network such as a celfadar network. 

55. A system according to claim 54 wherdn an audio message is iqput by said 
15 user to said mobile terminal which is converted to a text-based message. 

56. A system according to claim 54 wherein a text-based message is irqput by 
said user to said mobile terminal from which said audio message is generated. 

20 57. A system according to claim 54 to 56 wherein said communications 
network is the Internet and said mobile.terminal is WAP-^iabled. 

58. A toy conqsrising: 

speaker means f(»: playback of an audio signal; 
25 memory means to store a text-based message; and 

controller means operatively connecting said memory means and said 
speaker means for generating said audio signal for playback by said speaker 
means; 

wherein said controllCT means, in use, generates an audio message which is 
30 at least partly in a voice representative of a charact^ generaUy recognisable to a 
user. 
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59. A toy according to claim 58 wherein said controller means is operatively 
connected with a connection means that allows said toy to conmuinicate with a 
conq>uting device. 

5 60. A toy according to claim 59 wherein said conqmting device is a computer 
connected to said toy by a cable via said connection means . 

61. A toy according to claim 59 herein said coimection means is adapted to 
provide a wireless connection either directly to a coiEputer or through a 

10 communications network. 

62. A toy according to any one of claims 58 to 61 wherein said connection 
means allows text-based messages , such as email, or recorded audio messages to 
be provided to said toy for playback dnoug^ said speaker means. 

15 

63. A toy according to any one of chdnas 58 to 61 wherein said connection 
means allows an audio signal to be provided directly to said speaker means for 
playback of an audio message. 

20 64. A toy according to any one of claims 58 to 63 wherein said toy has the f ami 
of said character. 

65, A toy according to claim 64 wherein said toy is adapted to move its mouth 
and/or other facial or bodily features in response to said audio message. 

25 

.66. A toy according to claim 64 wherein the movement of said toy is 
s]^chronised with predetermined speech events of said audio message. 



30 



67. A toy according to any one of claims 58 to 66 wherein said toy is an 
electronic handheld toy having microprocessor-based controller means and a non- 
volatile memory means. 
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68. A toy accQrding to any one of claims 58 to 67 having means to allow for 
recording and playback of audio* 

69. A toy according to claim 68 wherdn audio recorded by said toy is 
5 converted to a text-based message which is then used to geneatale an audio message 

based on said text-based message* said audio message spoken in a voice of a 
gsneraUy recogni??abk character. > . . : . , v 

70. A toy conqxrising: 

10 speaker means for playback of an audio signal; 

memory means to store an audio message; and 

controller means opea^vely coimecting said memoiy means and said 
speaker means for generating said audio signal far playback by said speaks 
means; 

15 wherein said controller means, in use, generates said audio message winch 

is at least pardy in a vdce representative of a character generally recognisable to a 
user. 

7L A toy according to claim 70 wherein said controller means is operatively 
20 coimected with a connection means that aUows said toy to conununicate with a 
conqmting device, said computing device being connected to said toy through said 
coimection means. 

72. A toy according to claim 71 wherein said computing device converts a text- 
25 based message into said audio message for storage in said memory means. 

73. A system for genosling an audio message which is at least partly in a voice 
representative of a character generally recognisable to a user, said system 
comprising: 

30 means for transmitting a message request over a coiomunications network; 

message processing means for receiving said message request; 
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wherein said processiBg means processes said message request and 
constructs said audio messa^ that is at least partly in a voice iqxresentalive of a 
character generally recognisable to a user and forwarding the constructed audio 
message over said communications network to one or more redpients. 

5 

74. A system according to claim 73 wherdn said message request includes a 
send^ audio message ami iudd message processing metois constructs ^udio^ 
message based on said sender audio message. 

10 75. A system accordiug to claim 73 or claim 74 further comprising first data 
storage means linked to said message processing means to enable said message 
processing means access to said first data storage noeans to construct said audio 
message^ said first data storage means storing character audio recordings of one or 
msm characters generally recognisable to said user. 

15 . 

76. A system according to claim 74 or claim 75 ^idieidn said message 
processing means directs said user to provide responses to an interactive voice 
response system as part of said sendo* audio message. 

20 77. A system according too any one of claims 74 to 76 wherein said message 
processing means as accepts natural language input from said user, processes said 
natural language input, determines a text outccnne for said iuput and constructs 
said audio message based on said text outcome. 

25 78. A system according to any one of claims 74 to 76 wherein said message 
processing means has a speech interface for accepting constrained language Mset 
input via automated voice prompts, processing said constrained language user 
input, determining a text outcome for said constrained language user input and 
constructing said audio message based on said text outcome. 
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79. A system according to any one of claims 73 to 78 further comprising second 
data storage means linked to said message piocessing means for storing audio 
recordings of sound effects for insertion into said audio message. 

5 80. A system according to any one of claims 74 to 79 furtiher co]iq[Hising a first 
database storing malx^g phrases for use in constructing said audio message. 

81. A system according to any one of claims 74 to 80 further conqnising a 
corrections database for inserting speech portions into said audio message to 

10 correct or replace original speech portions of said message request 

82. A method for generating an audio message which is at least partly in a voice 
representative of a character generally recognisable to a user; said method 
coiiq)rising the following steps: 

15 transmitting a message request over a comnnmications network; 

processing said message request and constructing said audio message in at 
least partiy a voice represratative of a character generally recognisable to a user, 
and 

forwarding the constracted audio message over said communication 
20 network to one or more recipients. 

83. A method of generating an audio message, conqnising the steps of: 
providing a request to generate said audio message in a predetermined 

format; 

25 generating said audio message based on said request; 

wherein said audio message is at least partly in a voice which is 
representative of a character generally recognisable to a user. 

84. A con^uter program element comprising computer program code means to 
30 control a processing means to execute a procedure for generating an audio message 

according to the nof^od of any one of claims 1 to 19, claim 82 or clahn 83. 
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85. A computer readable memory, encoded with data representing a conq)uter 
program for directing a processing means to execute a procedure for generating an 
audio message according to the method of any one of claims 1 to 19, claim 82 or 
claim 83. 
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